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APPEAL BRIE F 

Sir: 

Appellants hereby submit an original and two copies of this Appeal Brief to the Board of Patent 
Appeals and Interferences ("the Board") in response to the Final Office Action mailed on May 20, 2003 . 
The Notice of Appeal was timely submitted on September 22, 2003 , and was received in the Patent and 
Trademark Office ("the Office") on September 26, 2003. This Appeal Brief is timely submitted in light of 
the concurrently filed Petition for an Extension of Time of four months to and including March 26, 2004 
and authorization to deduct the fee as required under 37 C.F.R. § 1.17(a)(2) fi-om Appellants' 
Representatives' deposit account. The Commissioner is also authorized to charge the fee for filing this 
Appeal Brief ($165.00), as required under 37 C.F.R. § 1 .17(c), to Lexicon Genetics Incorporated Deposit 
Account No. 50-0892. 

Appellants believe no fees in addition to the fee for filing the Appeal Brief and the fee for the 
extension of time are due in connection with this Appeal Brief However, should any additional fees under 
37C.F.R. §§ 1.16to 1.21 be required for any reason related to this communication, the Commissioner 
is authorized to charge any underpayment or credit any overpayment to Lexicon Genetics Incorporated 
Deposit Account No. 50-0892. 

I. REAL PARTY IN INTEREST 

The real party in interest is the Assignee, Lexicon Genetics Incorporated, 8800 Technology Forest 
Place, The Woodlands, Texas, 77381. 

II. RELATED APPEALS AND INTERFERENCES 

Appellants know of no related appeals or interferences. 

IIL STATUS OF THE CLAIMS 

The present application was filed on March 20, 2001 , claiming the benefit of U.S. Provisional 
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Application Numbers 60/1 90,63 8 , 60/1 9 1 , 1 88, and 60/1 93 ,63 9 which were filed on March 20, 2000, 
March 22, 2000, and March 3 1 , 2000, respectively, and included original claims 1 - 1 0. A Restriction and 
Election Requirement was issued by the Office on August 1 6, 2002, restricting the original claims into three 
separate and distinct inventions. Li a response to the Restriction Requirement, submitted to the Office on 
September 1 6, 200 1 , Appellants elected wdthout traverse the Group I invention (comprising original claims 
1-4) for prosecution on the merits and as a result claims 5-10 were canceled without prejudice and 
disclaimer as being drawn to a non-elected inventions. In a First Official Action, issued on October 22, 
2002 ("the First Action"), the Examiner objected to the title of the specification and rej ected claims 1 -4 
under 35 U.S. C. § 1 1 2, first paragraph, allegedly due to a lack of written description. Claim 1 was also 
rejected under 35 U.S.C. § 1 12, second paragraph, as being allegedly indefinite for recitation of the phrase 
"sequence first disclosed in SEQ ID NO: 1 ". Claim 2 was also rejected under 35 U.S.C. § 1 12, second 
paragraph, as being allegedly indefinite for recitation of the phrase "stringent conditions". In addition, claims 
1-4 were also rejected under 35 U.S.C. § 101, due to the alleged lack of patentable utility, and under 
35 U.S.C. § 1 12, first paragraph, as allegedly unusable by the skilled artisan due to the alleged lack of 
patentable utihty. In Appellants' response to the First Official Action, submitted to the Office on February 
1 2, 2003 ("response to the First Action"), Appellants amended the title of the specification, Claim 1 to 
fiirther improve its clarity and added new claims 1 1 - 1 2 to more particularly point out and distinctly claim 
the present invention. A Second and Final Official Action, was issued on May 20, 2003 (the "Final 
Action"), in which it was stated that ' *The rej ections and/or obj ections made in the prior office action, which 
are not explicitly stated below, in original or modified form are withdrawn" (page 2). Therefore objection 
to the title and rejections of claims 1 and 2 under 35 U.S.C. § 1 1 2, second paragraph, as being allegedly 
indefinite were withdravm. The pending rejections of claims 1-4 under 35 U.S.C. § 101, and under 
^3 5 U.S.C. § 1 1 2, first paragraph, due to the alleged lack of patentable utility were maintained. Additionally, 
claim 1 1 was rejected under 35 U.S .C. § 1 1 2, second paragraph, as being allegedly indefinite for recitation 
of the allegedly indefinite recitation of the phrase "comprising [a] nucleic acid sequence of Claim 4." In a 
response to the Final Action, submitted on September 22, 2003 ("response to the Final Action"), an 
amendment to claim 1 1 was submitted and Appellants again addressed the outstanding rejections of claims 



2 



1-4 ,11 andl2. A Notice of Appeal was also filed on September 22, 2003 and received by the 
U.S.P.T.O. on September 26, 2003. On December 5, 2003, Appellants received an atypical message 
from the Examiner that " that the current After Final was under consideration for allowable subject matter". 
This message was followed by a confirming interview summary (Paper No.2003 1201). As Appellants 
believed that the case contained allowable subj ect matter and based on the information in the Examiner' s 
telephone message and the confirming Interview summary supporting this position. Appellants did not 
immediately submit an Appeal Brief Following several unanswered telephone communications Appellants 
reached the Examiner during the week of February 23, 2004. At which time they were told that the 
Examiner did not have ready access to the case, but that he would investigate. Appellants left several more 
telephone messages during the week of March 15, 2004 and eventually received a retum call in which the 
Examiner represented that he had the case and was addressing the issue. During the week of March 22, 
2004 Appellants called the Examiner multiple times to determine the status of the case. On March 22, 
2004 Appellants were told that the Examiner was prepared to discuss the case with his Supervisory Patent 
Examiner the next day. Appellants left additional messages and following a message on March 25, 2004, 
the Examiner left a retum message that the case contained no allowable material. The Examiner alleged 
that he would mail out and Advisory Action shortly. Appellants note that at no time did the Examiner 
initiate contact to correct the misperception that the case contained allowable material. As Appellants had 
already lost potential patent term and accumulated costs in extension fees, and faced a 24 hour deadline 
for additional fees and because Appellants had no way ofknowing what said Advisory Action might say, 
when and if it arrived, the present Appeal Brief and a 4 month extension of time has been filed. Thus, this 
Appeal Brief is based on Appellants last official written communication with the Office. A copy of the 
appealed claims (as of the Final Office Action) is included below in the Appendix (Section IX). 

IV. STATUS OF THE AMENDMENTS 

Appellants filed a response to the Final Office Action on September 22, 2003 that contained 
amendments to the claims. As at this time Appellants have received no Advisory Action entering these 
amendments into the case, Appellants must assume the these proposed amendments were not entered in 
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the case and are therefore outstanding. 



V. SUMMARY OF THE INVENTION 

The present invention relates to Appellants' discovery and identification of novel human 
polynucleotide and amino acid sequences that encode a novel human semaphorin protein (specification 
at or about page 2, lines 5-8 and 14-15; page 4, line 1 0 and page 1 7, lines 1 0 and 14-18). Semaphorins 
are a class of molecules with recognized function and utility having been implicated in mediating neural 
processes, cancer, and development. The semaphorin of the present invention was shown to be expressed 
in human fetal brain, brain, cerebellum, thymus, spleen, lymph node, kidney, uterus, adipose, esophagus, 
cervix, rectum, pericardium, and placenta (specification at or about page 4, lines 12-15) and those of skill 
in that art recognize that semaphorins are known to act to regulate the organization and fasciculation of 
nerves in the body (specification at or about page 2, lines 5-8). Thus the sequences of the present invention 
encode a molecule with specific, substantial and well-established fimction and utility. Additional uses 
described in the specification include assessing temporal and tissue specific gene expression patterns (at 
or about specification at page 7, line 23), particularly using a high throughput "chip' ' format (specification 
at page 6, line 29 through page 9), mapping the sequences to a specific region of a human chromosome 
and identifying protein encoding regions (specification at or about page 3 , line 11), determining the genomic 
structure (specification at or about page 12, line 4), identifying verified intron/exon splice junctions 
(specification at or bout page 1 2, lines 5-10) and in diagnostic assays such as forensic analysis, human 
population biology and paternity determinations (see, for example, the specification at or about page 9, line 
7; page 12, line 5 and page 18, line 11). 

VI. ISSUES ON APPEAL 

1. Do claims 1-4, 11 and 12 lack a patentable utility? 

2. Areclaims 1-4, 11 and 12 unusable by a skilled artisan due to a lack of patentable utility? 

3. Is claim 1 1 indefiinite? 
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4. GROUPING OF THE CLAIMS 

For the purposes ofthe outstanding rejections under 35 U.S.C. § 101 and35U.S.C. § 112, first 
paragraph, the claims will stand or fall together. The rejection of claim 11 under 35 U.S.C. § 112, first 
paragraph for allegedly being indefinite will stand and fall alone. 

5. ARGUMENT 

A, Do Claims 1-4, 11 and 12 Lack a Patentable Utility? 

The Final Action first rej ects claims 1-4,11 and 1 2 under 3 5 U. S .C . § 1 0 1 , as allegedly lacking 
a patentable utility due to not being supported by either a specific and substantial utility or a well-established 
utility. Appellants strongly disagree. 

Appellants respectfully submit that the question of utility is a straightforward one as established by 
the courts. As set forth by the Federal Circuit, "(t)he threshold ofutility is not high: An invention is 'usefiil' 
under section 101 ifit is capable ofproviding some identifiable benefit." Juicy Whiplna v. Orange Bang 
Inc, 51 USPQ2d 1700 (Fed. Cir. 1999) (citing Brenner v. Manson, 383 U.S. 519, 534 (1966)). 
Additionally, the Federal Circuit has stated that "(t)o violate § 1 0 1 the claimed device must be totally 
incapable of achieving a useful result." Brooktree Corp, v. Advanced Micro Devices, Inc, 977 F.2d 
1555, 1571 (Fed. Cir. 1992), emphasis added. Cross v. lizuka (224 USPQ 739 (Fed. Cir. 1985); 
"Cro55") states *'^utilityoftheclaimed compounds is sufficientto satisfy35U.S.C. § 101". Cross at 
748, emphasis added. Indeed, the Federal Circuit recently emphatically confirmed that "anything under 
the sun that is made by man" is patentable (State Street Bank & Trust Co. v. Signature Financial Group 
Inc, 47 USPQ2d 1596, 1600 (Fed. Cir. 1998), citing the U.S. Supreme Court's decision inDiamond 
vs. Chakrabarty, 206 USPQ 193 (S.Ct. 1980)). 

The legal test for utility simply involves an assessment of whether those skilled in the art would find 
any of the utilities described for the invention to be credible or believable. According to the Examination 
Guidelines for the Utility Requirement, if the applicant has asserted that the claimed invention is useful for 
any particular purpose (i.e., it has a "specific and substantial utility") and the assertion would be considered 
credible by a person of ordinary skill in the art, the Examiner should not impose a rej ection based on lack 
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of utility (66 Federal Register 1098, January 5, 2001). 

InlnreBrana, (34USPQ2d 1436 (Fed. Cir. 1 995), "5ran^"), the Federal Circuit admonished 

the P.T.O. for confusing "the requirements under the law for obtaining a patent with the requirements for 

obtaining government approval to market a particular drug for human consumption". Branaatl442. The 

Federal Circuit went on to state: 

At issue in this case is an important question of the legal constraints on patent office 
examination practice and policy. The question is, with regard to pharmaceutical inventions, 
what must the applicant provide regarding the practical utility or usefulness of the invention 
forwhichpatent protection is sought. This is not a new issue: it is one which we would 
have thought had been settled by case law vears ago . 

Brana at 1 43 9, emphasis added. The choice of the phrase ' 'utility or usefulness' ' in the foregoing quotation 
is highly pertinent. The Federal Circuit is evidently using "utility" to refer to rejections under 
35U.S.C. § 101, and is using "usefulness" to referto rejections under35U.S.C. § 112, first paragraph. 
This is made evident in the continuing text in Brana, which explains the correlation between 3 5 U. S .C. 
§§101 and 112, first paragraph. The Federal Circuit concluded: 

FDA approval, however, is not a prerequisite for finding a compound useful within the 
meaning of the patent laws. Usefulness in patent law, and in particular in the context of 
pharmaceutical inventions, necessarily includes the expectation of further research and 
development. The stage at which an invention in this field becomes useful is well before 
it is ready to be administered to humans. Were we to require Phase n testing in order to 
prove utility, the associated costs would prevent many companies firom obtaining patent 
protection on promising new inventions, thereby eliminating an incentive to pursue, through 
research and development, potential cures in many crucial areas such as the treatment of 
cancer. 

Brana at 1442-1443, citations ornitted. In assessing the question of whether undue experimentation 
would be required in order to practice the claimed invention, the key term is "undue", not 
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"experimentation". In reAngstadt and Griffin, 190 USPQ 214 (C.C.P.A. 1976). The need for 
some experimentation does not render the claimed invention unpatentable. Indeed, a considerable 
amount of experimentation may be permissible if such experimentation is routinely practiced in the art. 
In re Angstadt and Griffin, supra; Amgen, Inc. v. Chugai Pharmaceutical Co,, Ltd,, 18 USPQ2d 
1016 (Fed. Cir. 1991). As a matter of law, it is well settled that a patent need not disclose what is well 
known in the art. In re Wands, 8 USPQ 2d 1400 (Fed. Cir. 1988). 

Even under the newly installed utility guidelines, Appellants note that MPEP 2107 (II)(B)(1) 

states: 

(1) If the applicant has asserted that the claimed invention is useful for any particular practical 
purpose (i.e., it has a "specific and substantial utility") and the assertion would be considered 
credible by a person of ordinary skill in the art, do not impose a rejection based on lack of 
utility. (MPEP 2107 (II)(B)(1)) 

Presented in the First Official Action, and maintained in the Final Action, was the Examiner's 
position that the specification does not disclose a specific and substantial or well-established utility for 
the claimed invention. Appellants strongly disagree and note that in the specification (at or about page 
2, lines 5-8 and 14-15; page 4, line 10 and page 17, lines 10 and 14-18) it was asserted that the 
sequences of the present invention encode a human semaphorin protein that is structurally similar to 
other known semaphorins. These statements in the specification assert that the sequences of the 
present invention and known semaphorins share a similarity in structure, a similarity in function and a 
similarity in biological function. This would be accepted by those of skill in the art, as it is generally 
recognized that there is a structure-function relationship. Thus clearly the sequences of the present 
invention have patentable utility and pending rejections under 35 U.S.C. § 101 and 35 U.S.C. § 1 12, 
first paragraph should be withdrawn. 

First, as set forth in the response to the First Action, reiterated in the response to the Final 
Action, but never addressed in any of the Actions, Appellants would like to invite the Board's attention 
to the fact that a sequence, that is 99.872% identical over its entire length and which encompases 89.5 
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% (782/874 of the amino acids) of the full length of the described sequence (SEQ ED N0:3) is present 
in the leading scientific repository for biological sequence data (GenBank), and has been annotated by 
third party scientists wholly unaffiliated with Appellants as as encoding semaphorin sem2 [Homo 
sapiens] (GenBank accession no. BAA98132 aHgnment and information previously provided and as 
Exhibit A). Also as previously submitted was evidence in the form of a nucleic acid comparison 
between SEQ ID N0:3 and GenBank accession no. AB029496.1 (alignment and information 
previously provided and as Exhibit B), identified as Homo sapiens mRNA for semaphorin sem2. 
Thus clearly the identity between the sequences of the present invention and human semaphorin sem2 
also exists at the nucleic acid level. Furthermore, as previously submitted in the response to the First 
Action is the results of a nucleic acid sequence comparison between SEQ E) NO: 1 and SEQ ID 
N0:3 of the present invention, clearly indicating that SEQ ID N0:1 (see information previously 
provided and as Exhibit C comparing SEQ ED NOS: 3 and 1) identifies a longer isoform of the present 
invention, which is clearly encoded by the same genetic locus. Both the molecules described in SEQ 
ID NOS: 3 and 1 contain the recognized semaphorin signaling domain present in human semaphorin 
sem2 and the sema domains which are known to those of skill in the art to occur in semaphorins. Thus 
these molecules contain all the fiinctional structure and domains required for them to fimction as a 
signaling semaphorin. Without doubt, those of skill in the art would readily recognize the sequences of 
the present invention as encoding a functioning human semaphorin. 

Additionally, Appellants respectfiiUy submit that human semaphorin are well known to those of 
skill in the art, semaphorins have recognized fimction and utility having been implicated in mediating 
neural processes and those of skill in that art recognize that semaphorins are known (as represented in 
the specification at or about page 2, lines 5-8) to act to regulate the organization and fasciculation of 
nerves in the body. Thus the sequences of the present invention encode a molecule with specific, 
substantial and well-established fimction and utility. 

Clearly those of skill in the art would recognize the sequences of the present invention as 
encoding a human semaphorin. As evidenced by the review article entitled "Molecular Mechanisms of 
Axonal Guidance" fi-om the prestigious joumal Science (298:1959-1964, 2002 and erratum; document 
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previously provided and as Exhibit D), semaphorins are well known to those of skill in the art as 
soluble and membrane-bound proteins that act as chemorepulsive factors in neuronal development, 
thereby playing a crucial role in axon guidance. Semaphorins, such as the one described in the present 
invention, provide guidance for neuronal growth. In the second paragraph of section 5.1 or the 
specification as filed, it is stated that "Because of their role in neural development, semaphorins have 
been subject to considerable scientific scrutiny. For example, U.S. Patents Nos. 5,981,222 and 
5,935,865, both of which are herein incorporated by reference, describe other semaphorins as well as 
applications, utilities". Therefore, clearly, there can be no question that Appellants' asserted identity 
and utility for the described sequences a semaphorin is " credible ." In addition, those of skill in the art 
in the biomedical and pharmaceutical industry would readily recognize the utility for semaphorins and 
their application to medical conditions requiring nerve regeneration. For example, the regeneration and 
repair of nerve tissue following the surgical attachment of severed limbs or the resection of diseased 
tissue, as well as nerve repair following a stroke. The specification details tissues in which these 
sequences are expressed (human fetal brain, brain, cerebellum and others at or about page 4, lines 12- 
15) and disease associations, both of which are consistent with the evidence provided and asserted 
utility. 

Thus Appellants have provided evidence that the sequences of the present invention which 
were asserted in the specification to encode novel human semaphorin proteins do indeed encode the 
human semaphorin proteins (specifically longerisoforms of sem2). This evidence includes sequence 
identity, tissue expression, disease association and, below, genetic mapping to the same loci. 
Therefore, clearly, there can be no question that Appellants' asserted utility for the described sequences 
is " credible " and that those of skill in the art would recognize that the sequences of the present invention 
encode a semaphorin protein, more particularlyisoforms of sem2 and has all the recognized uses 
thereof In contrast, the Examiner has provided no evidence of record indicating that those of skill in 
the art would not recognize the sequences of the present invention encode semaphorin proteins. As 
such, the scientific evidence clearly establishes that Appellants have described an invention having a 
specific, substantial and well-established utility and whose utility is in full compliance with the provisions 
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of 35 U.S.C. § 101, and the Examiner's rejection should be overturned. 

Furthermore, Appellants respectfully submit that the Examiner's position, in light of the evidence 
provided, runs contrary to Example 10 of the PTO's Revised Interim Utility Guidelines Training 
Materials (pages 53-55), which estabhshes that a rejection under 35 U.S.C. § 101 as allegedly lacking 
a patentable utiUty and under 35 U.S.C. § 112, first paragraph as allegedly unusable by the skilled 
artisan due to the alleged lack of patentable utility, is not proper when there is no reason to doubt the 
asserted utility of a full length sequence (such as the presently claimed sequence) that has a high degree 
of similarity to a protein having a known function. In the Analysis portion of Example 10 it states that 
"Based on applicant's disclosure and the results of the PTO search, there is no reason to doubt the 
assertion that SEQ ID N0:2 encodes a DNA Hgase. Further DNA ligases have a well-established use 

in the molecular biology art based on this class of proteins ability to ligate DNA Note that if there is 

a well-established utility already associated with the claimed invention, the utility need not be asserted 

in the specification as filed Thus the conclusion reached from this analysis is that a 35 U.S.C. § 101 

and a 35 U.S.C. § 1 12 first paragraph, utility rejection should not be made." 

In the present case, clearly the evidence supports Appellants' assertions that the sequences of 
the present invention encode human semaphorin proteins (specifically isoforms of sem2), a class of 
proteins for which there is a well established utility that is recognized by those of skill in the art and a 
specific semaphorin whose function is known to those of skill in the art. Thus the present case is 
identical to that presented in Example 10 of the Revised Interim Utility Guidelines Training Materials 
(pages 53-55). In the present case it is clear that the sequences of the present invention encode human 
semaphorin proteins (specifically isoforms of sem2). The Examiner dismisses Appellants' continued 
assertions and the evidence provided that the protein of the present invention are semaphorins 
(specifically isoforms of sem2) and that the function of semaphorins as a class of proteins is well known 
to those of skill in the the art. Thus, according to the guidelines the conclusion reached from this 
analysis is that a 35 U.S.C. § 101 and a 35 U.S.C. § 1 12 first paragraph, utility rejection should not 
have been made. Thus the rejection of the presently claimed invention under a 35 U.S.C. § 101 and a 
35 U.S.C. § 112 first paragraph utiHty rejection should be overruled. 
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The First Action (and maintained in the Final Action) takes issue with the fact that the 
specification discloses no data for any activity of the present invention and that there are no v^orking 
examples, indicating a need for such information is misplaced. It has long been established that "there 
is no statutory requirement for the disclosure of a specific example". In re Gay, 135 USPQ 311 
(C.C.P.A. 1962). The Actions also assume the position that structural homology cannot be accepted in 
the absence of supporting evidence, because the relevant literature acknowledges that function cannot 
be based solely on structural similarity to a protein found in the sequence database. In support of this 
position the Final Action cites Bork (Genome Research 70:398-400, 2000) as supporting the 
proposition that prediction of protein function from homology information is somewhat impredictable. It 
is of interest that in his "analysis" Bork often uses citations to many of his own previous publications, an 
interesting approach. ' My position is supported by my previous disclosures of my position.' If Bork's 
position is supported by others of skill in the art, one would expect that he would reference them rather 
than himself to provide support for his statements. Given that the standard with regard to obtaining 
U.S. patents is those of skill in the art, this observation casts doubt on the broad applicability of Bork's 
position. It should also be noted that in Table 1, on page 399, in which selected examples of prediction 
accuracy are presented, that the reported accuracy of the methods which Appellants have employed 
are, in fact, very high. While nowhere in Bork is there a comparison of the prediction accuracy based 
on the percentage homology between two proteins or two classes of proteins, " Homology (several 
methods)" is assigned an accuracy rate of 98% and "Functional features by homology" is assigned an 
accuracy rate of 90%. Given that these figures were obtained based on what is at least a 4 year old 
analysis, these high levels of accuracy would appear to support rather than refute Appellants' assertions 
in the present case. Additionally Bork even states (on page 400, second column, line 17 ) that " 
However, there is still no doubt that sequence analysis is extremely powerful". In summary, it is clear 
that it is not Bork's intention to refute the value of sequence analysis but rather he is indicating that there 
is room for improvement . 

In sunmiary a careful reading of the cited "relevant literature" does not in fact support the 
concept that function cannot be based on sequence and structural similarity, in contrast many of the 
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examples actually support the use of such methodologies while identifying several areas in which caution 
should be exercised. These inaccuracies and potential pitfalls can be overcome by a more careful 
analysis by those of skill in the art. Automatic methods of sequence homology identification was only 
the staring point for consideration the sequences of the present invention underwent careful analysis by 
a series of individuals of skill in the art, many highly quaUfied (experienced B.S. and 3 Ph.D. level 
scientists). 

Furthermore, this articles is just an example of the few contrarian articles that the PTO has 
repeatedly attempted to use to deny the utility of nucleic acid sequences based on a small number of 
publications that call into doubt prediction of protein function from homology information and the 
usefulness of bioinformatic predictions. While there may not be a 100% consensus within the scientific 
community regarding prediction of protein function from homology information, this is not unusual nor is 
it indicative of a general lack of consensus. A few rare exceptions do not a rule make. 

The position that bioinformatic information is recognized to be of value by those of skill in the 
art is supported by the results of a recent search of the NCBI-NLM-NIH public scientific database 
"PubMed" using the term "bioinformatics" which resulted in 5,548 different scientific publications (these 
will not be provided to avoid burdening the USPTOs scanning group). If bioinformatic information is 
not useful in predicting protein function from structural homology information, why are so many 
publications reporting the results of its use? Clearly this suggest that those of skill in the art do recognize 
bioinformatic data as useful and valid. 

A second form of evidence supporting the position that bioinformatic information is recognized 
to be of value by those of skill in the art is the fact that many scientists, corporations and institutions 
elect to allocate significant proportions of their Umited resources for access to private bioinformatic 
systems and databases. Thus, it would appear obvious that those of skill in the art value and accept the 
results of bioinformatic analysis for they are willing to pay dearly for access to such information. 

A third, and perhaps most persuasive, form of evidence supporting the position that 
bioinformatic information is recognized to be of value by those of skill in the art is the issuance of 
multiple US patents regarding bioinformatic prediction and methods for doing the same (see for 



12 



example, U.S. Patent Nos. 6,229,911, 6,567,540, 6,615,141, 6,631,331, 6,651,008, 6,677,114, 
Exhibits E-J; copies of issued U.S. Patents not provided pursuant to current United States Patent and 
Trademark Office policy). Of particular interest might be U.S. Patent No. 6,466,874 (Exhibit K; 
copies of issued U.S. Patents not provided pursuant to current United States Patent and Trademark 
Office policy), one of whose claims reads on "A method of identifying proteins as functionally linked, 
the method comprising comparing sequences to find homologous functional domains." Why would a 
U.S. Patent have issued on a method of carrying out an analysis that is without utility, because it is not 
accepted by those of skill in the art as a credible method of predicting function from structural 
homology information? This evidence convincingly indicates that even the USPTO recognizes the utility 
of bioinformatic prediction. 

Appellants respectfully point out that, as discussed above, the legal test for utility simply 
involves an assessment of whether those skilled in the art would find any of the utilities described for the 
invention to be believable. Appellants submit that the overwhelming majority of those of skill in the 
relevant art would believe prediction of protein function fi'om homology information and the usefulness 
of bioinformatic predictions to be powerful and useful tools. Clearly the several forms of evidence 
presented, and certainly the issuance of U.S. Patents suggest that those of skill in the art recognize the 
utility of bioinformatic analysis and its credibility in assessing structure function relationships. Thus the 
vast majority of those of skill in the art would beheve that Appellants' sequence encodes a human 
semaphorin proteins (specifically iso forms of sem2), a molecule of specific, substantial and well- 
established utility and thus rejection of the presently claimed invention imder a 35 U.S.C. § 101 and a 
35 U.S.C. § 1 12 first paragraph should be overruled. 

In addition to those utilities presented above, a still further example of utility of the present 
sequences is their use in diagnostic assays such as those associated with identification of patemity and 
forensic analysis, among others (see, for example, the specification at or about page 9, line 7; page 12, 
line 5 and page 18, line 11). The sequences of the present invention have particular utility as the 
appUcation as filed contained an identified polymorphism (at or about page 17, line 10-13). This results 
in a translationally silent A-to-G transition at, for example, the position corresponding to nucleotide 
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2106 ofSEQIDNO:!. 

Naturally occurring genetic polymorphisms such as those described in the present specification 
are both the basis of, and critical to, inter alia, forensic genetic analysis and genetic analysis intended to 
resolve issues of identity and paternity. Therefore, Appellants find this position difficult to comprehend, 
given that the results of identity and patemal analysis often have great emotional and substantial 
economic impact. This does not sound like a throw away utility, rather it sounds like a very substantial 
and real world utility. What could be more substantial and real world than the loss of an individual's 
fi-eedom through incarceration and in some cases even the loss of life through execution? Yet forensic 
analysis based on identified polymorphisms is often used to convict or acquit in many cases. Both 
patemal and forensic genetic analysis is based on the use of identified polymorphisms. This is a well 
known and generally accepted by those of skill in the art, who would readily recognize the utility and 
value of any identified polymorphism. Without identified polymorphisms, one would not be able to 
carry out such forensic or patemal analyses. The present application has identified just such essential 
polymorphisms within the sequences of the present invention which identify human semaphorin proteins 
(specifically isoforms of sem2), a molecule of well-estabHshed utility. 

Such polymorphisms are the basis for forensic analysis, patemity identification and population 
biology studies, which are undoubtedly "real world" utilities and thus the present sequences must in 
themselves be usefiil. hi and of themselves each of these polymorphisms, including the silent ones, has 
significant and specific utility , the specificity of this utility is only amplified by the presence of so many 
polymorphisms that can arise in various combinations. It is also important to note that the presence of 
more usefiil polymorphic markers for such analysis would not mean that the present sequences lack 
utility. 

Appellants respectfiilly point out that those of skill in the art would readily recognize that the 
presently described polymorphisms, exactly as they were described in the specification as originally 
filed, are usefiil in forensic analysis, population biology and patemity analysis to specifically identify 
individual members of the human population based on the presence or absence of the described 
polymorphism. Simply because the use of these polymorphic markers will necessarily provide 
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additional information on the percentage of particular subpopulations that contain one or more of these 
polymorphic markers does not mean that "additional research" is needed in order for these markers as 
they are presently described in the instant specification to be of use to forensic science. Without further 
experimentation those of skill in the art would recognize the utility of the identified polymorphisms and 
how the asserted markers can distinguish 50% of the population in the worst case scenario. Thus the 
presence or the absence of a particular specific polymorphism is sufficient for use in the proposed 
utilities. Appellants provide the following detailed explanation. Those of skill in the art would recognize 
that in the worst case, least useful situation, a marker would be present in half of a population and 
absent from the other half Therefore the probability of an individual having such a marker would be 1 
in 2 or 50%. Using the forensic analysis scenario for example, the analysis will have removed 50% of 
the possible suspects from the list, as either the suspect has the identified polymorphism or not. 
However, if a polymorphism were present in only say 10% of the population, the probability of an 
individual having such a polymorphic marker would be 1 in 10 (10%) and 90% of suspects could be 
eliminated from investigation or prosecution based on the presence or absence of the polymorphism. 
Clearly eliminating 90% of the suspects is better than eliminating 50% of the suspects. That said, 
eliminating 50% or half of the suspects on a list is without question very useful to any investigator. To 
reiterate, using the polymorphic markers as described in the specification as originally field will definitely 
distinguish members of a population from one another. In the worst case scenario, each of these 
markers are useful to distinguish 50% of the population (in other words, the marker being present in half 
of the population). The ability to eliminate 50% of the population from a forensic analysis clearly is a 
real world, practical utility. Therefore, any allegation that the use of the presently described 
polymorphic markers is only potentially useful would be completely without merit, and would not 
support the alleged lack of utility. 

The Examiner's assumption appears to be that since any human nucleic acid sequence that 
contains a naturally occurring polymorphism can be used in forensic analysis, in human paternity 
determinations or human population migration determinations, such utilities are generic and therefore 
lack substantial and specific utility. First, Appellants submit that until a specific polymorphic marker is 
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actually described it has very limited utility in forensic analysis. Put another way, simply because there 

is a possibility, even a significant likelihood, that a particular nucleic acid sequence will contain a 

polymorphism and thus be useful in forensic analysis, until such a specific polymorphism is actually 

identified and described, such a hkelihood is meaningless. The present case contains identified 

polymorphisms that occur in human semaphorin proteins (specifically isoforms of sem2). The Examiner 

is perhaps attempting to use the information presented for the first time by Appellants in the instant 

specification as hindsight verification that the presently claimed sequence would be expected to have 

polymorphic markers. Such a hindsight analysis based on Appellants' discovery would not be proper. 

Alternatively, the assumption that since any sequence containing a naturally occurring 

polymorphism can be used such utilities are generic and therefore lack substantial and specific utility 

may represent a confusion between the requirement for a specific utility, which is the proper standard 

for utility under 35 U.S.C. § 101, with a requirement for a unique utility. The relevant case law cited by 

Appellants makes it abundantly clear that the presence of other or even more useful polymorphic 

markers for forensic analysis does not mean that the present sequences lack a specific utility. As 

clearly stated by the Federal Circuit in Carl Zeiss Stiftung v. Renishaw PLC, 20 USPQ2d 1101 

(Fed. Cir. 1991; ''Carl Zeiss''): 

An invention need not be the best or only way to accomplish a certain result, and it need 
only be useful to some extent and in certain applications: "[T]he fact that an invention has 
only limited utility and is only operable in certain applications is not grounds for finding a 
lack of utility." Envirotech Corp. v. Al George, Inc., 221 USPQ 473, 480 (Fed. Cir. 
1984) 

Importantly, the holding in the Carl Zeiss case is mandatory legal authority that essentially controls the 
outcome of the present appeal. This case, and particularly the cited quote, directly rebuts any such 
argument. Furthermore, the requirement for a unique utility is clearly not the standard adopted by the 
Patent and Trademark Office. If every invention were required to have a unique utility, the Patent and 
Trademark Office would no longer be issuing patents on batteries, automobile tires, golf balls, golf 
clubs, and treatments for a variety of human diseases, such as cancer and bacterial or viral infections, 
just to name a few particular examples, because examples of each of these have already been 
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described and patented. AU batteries have the exact same utiHty - specifically, to provide power. All 
automobile tires have the exact same utility - specifically, for use on automobiles. All golf balls and golf 
clubs have the exact same utility - specifically, use in the game of golf AH cancer treatments have the 
exact same utility - specifically, to treat cancer. All anti-infectious agents have the exact same broader 
utility - specifically, to treat infections. However, only the briefest perusal of virtually any issue of the 
Official Gazette provides numerous examples of patents being granted on each of the above 
compositions every week . Furthermore, if a composition needed to be unique to be patented, the 
entire class and subclass system would be an effort in futility, as the class and subclass system serves 
solely to group such common inventions, which would not be required if each invention needed to have 
a unique utihty. Thus, the present sequence clearly meets the requirements of 35 U.S.C. § 101 . 

In Addition, the First and Final Actions discount Appellants' assertion regarding the use of the 
presently claimed polynucleotides on DNA gene chips, based on the position that such a use would 
allegedly be generic. Further, these Actions seem to require Appellants to identify the biological role of 
the nucleic acid or function of the protein encoded by the presently claimed polynucleotides before the 
present sequences can be used in gene chip applications that meet the requirements of § 101 . 
Appellants respectfully point out that knowledge of the exact function or role of the presently claimed 
sequence is not required to track expression patterns using a DNA chip. As set forth in Appellants' 
First Response, given the widespread utility of such "gene chip" methods using public domain gene 
sequence information, there can be little doubt that the use of the presently described novel sequences 
would have great utility in such DNA chip applications. Even though not a requirement for use of a 
sequence on a DNA chip, clearly, the claimed sequences which encode human semaphorin proteins 
(specificallyisoforms of sem2), a molecule of recognized function that is beheved to play a role in 
human disease, provide a specific marker of the gene encoding this protein and provide a unique 
identifier of the corresponding gene in the human genome. Such specific markers are targets for 
discovering drugs that are associated with human semaphorins known to act to regulate the organization 
and fasciculation of nerves in the body and involved in human neural processes, stroke, cancer, and 
development, among others. Thus, those skilled in the art would instantly recognize that the present 
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nucleotide sequence would be an ideal, novel candidate for assessing gene expression using, for 
example, DNA chips, as the specification details at least on or about page 6, linel through page 8. 
Such "DNA chips" clearly have utility, as evidenced by hundreds of issued U.S. Patents, exemplified by 
U.S. Patent Nos. 5,445,934, 5,556,752, 5,744,305, 5,837,832, 6,156,501 and 6,261,776 
(Exhibits L-Q; copies of issued U.S. Patents not provided pursuant to current United States Patent 
and Trademark Office policy). 

The Board is further requested to consider that, given the huge expense of the drug discovery 
process, even negative information has great "real world" practical utility. Knowing that a given gene is 
not expressed in medically relevant tissue provides an informative finding of great value to industry by 
allowing for the more efficient deployment of expensive drug discovery resources. Such practical 
considerations are equally applicable to the scientific community in general, in that time and resources 
are not wasted chasing what are essentially scientific dead-ends (from the perspective of medical 
relevance). Clearly, compositions that enhance the utility of such DNA gene chips, such as the 
presently claimed sequences human semaphorins (variants of sem2) associated with human disease, 
must in themselves be useful. Moreover, the presently described sequences which sequences human 
semaphorins (variants of sem2) provide uniquely specific sequence resources for identifying and 
quantifying full length transcripts that were encoded by the corresponding human genomic locus. 
Accordingly, there can be no question that the described sequences provide an exquisitely specific 
utility for analyzing gene expression. Apparently the Examiner sees no public benefit in drugs or 
diagnostic assays directed at human disease, such as those involving neural processes like stroke, 
cancer, and developmental abnormalities. 

Additionally, only a small percentage of the genome (2-4%) actually encodes exons, which in 
turn encode amino acid sequences. Thus, not all human genomic DNA sequences are useful in such 
gene chip applications. This further discounts the Examiner's position that such uses are "generic". 
The present claims clearly meet the requirements of 35 U.S.C. § 101 . It has been clearly established 
that a statement of utility in a specification must be accepted absent reasons why one skilled in the art 
would have reason to doubt the objective truth of such statement. In re Langer, 503 F.2d 1380, 
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1391, 183 USPQ 288, 297 (CCPA, \91A)\InreMarzocchU 439F.2d220, 224, 169 USPQ 367, 
370 (CCPA, 1971). 

Evidence of the "real world" substantial utility of the present invention is further provided by the 
fact that there is an entire industry based on the use of gene sequences or fragments thereof in a gene 
chip format. Perhaps the most notable gene chip company is Affymetrix. However, there are many 
companies which have, at one time or another, concentrated on the use of gene sequences or 
fragments, in gene chip and non-gene chip formats, for example: Gene Logic, ABI-Perkin-Ehner, 
HySeq and Incyte. In addition, one such company, Rosetta Inpharmatics, was viewed to have such 
"real world" value that it was acquired by large pharmaceutical company, Merck & Co., for substantial 
sums of money (net equity value of the transaction was $620 million). The "real world" substantial 
industrial utility of gene sequences or fragments would, therefore, appear to be widespread and well 
established. Clearly, persons of skill in the art, as well as venture capitalists and investors, readily 
recognize the utility, both scientific and commercial, of genomic data in general, and specifically human 
genomic data. Billions of dollars have been invested in the human genome project, resulting in useful 
genomic data (see, e,g,. Venter et al, 2001, Science 297: 1304; Exhibit R). The results have been a 
stunning success as the utility of human genomic data has been widely recognized as a great gift to 
humanity (see, e.g., Jasny and Kennedy, 2001, Science 2P/:1 153; Exhibit S). Clearly, the usefulness 
of human genomic data, such as the presently claimed nucleic acid molecules, is substantial and credible 
(worthy of billions of dollars and the creation of numerous companies focused on such information) and 
well-estabHshed (the utility of human genomic information has been clearly understood for many years). 

Further evidence of utility of the presently claimed polynucleotide, although only one is needed 
to meet the requirements of 35 U.S.C. § 101 {Raytheon v. Roper, 220 USPQ 592 (Fed. Cir. 1983); 
In re Gottlieb, 140 USPQ 665 (CCPA 1964); In re Malachowski, 189 USPQ 432 (CCPA 1976); 
Hoffman v. Klaus, 9 USPQ2d 1657 (Bd. Pat. App. & Inter. 1988)), is the specific utility the present 
nucleotide sequence has in determining the genomic structure of the corresponding human chromosome 
(specification at or about page 12, line 4), for example mapping the protein encoding regions as 
described in the specification (specification at or about page 3, line 1 1 and page 12, lines 5-10) and as 
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evidenced in the response to the Final Action and reiterated below. Clearly, the present polynucleotide 
provides exquisite specificity in localizing the specific region of the human chromosome containing the 
gene encoding the given polynucleotide, a utility not shared by virtually any other nucleic acid sequence, 
hi fact, it is this specificity that makes this particular sequence so usefiiL Early gene mapping techniques 
relied on methods such as Giemsa staining to identify regions of chromosomes. However, such 
techniques produced genetic maps with a resolution of only 5 to 10 megabases, far too low to be of 
much help in identifying specific genes involved in disease. The skilled artisan readily appreciates the 
significant benefit afforded by markers that map a specific locus of the human genome, such as the 
present nucleic acid sequence. 

Only a minor percentage of the genome actually encodes exons, which in turn encode amino 
acid sequences. The presently claimed polynucleotide sequence provides biologically validated 
empirical data (e.g., showing which sequences are transcribed, spliced, and polyadenylated) that 
specifically defines that portion of the corresponding genomic locus that actually encodes exon 
sequence. Equally significant is that the claimed polynucleotide sequence defines how the encoded 
exons are actually spliced together to produce an active transcript (/.e, the described sequences are 
useful for functionally defining exon splice-junctions). The Appellants respectfully submit that the 
practical scientific value of expressed, spliced, and polyadenylated mRNA sequences is readily 
apparent to those skilled in the relevant biological and biochemical arts. For further evidence 
supporting the Appellants' position, the Board is requested to review, for example, section 3 of Venter 
etaL (supra at pp. 1317-1321, including Fig. 11 at pp. 1324- 1325), which demonstrates the 
significance of expressed sequence information in the structural analysis of genomic data. The presently 
claimed polynucleotide sequence defines a biologically validated sequence that provides a unique and 
specific resource for mapping the genome essentially as described in the Venter et al. article. 

While it is clear that the present nucleotide sequences have specific utility in determining 
genomic structure, mapping of the corresponding human chromosome, and determining protein 
encoding regions as was described in the specification, discussed in Appellants' response to the First 
Action and both discussed and evidenced in Appellants' response to the Final actions is reiterated here. 
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Evidence supporting Appellants' assertions of the specific utility of the sequences of the present 
invention in localizing the specific region of the human chromosome and identification of functionally 
active intron/exon splice junctions is the information provided as Exhibit T. This is the resuh of 
overlaying the sequence of SEQ ID N0:1 of the present invention and the identified human genomic 
sequence. By doing this, one is able to identify the portions of the genome that encode the present 
invention. As these regions of the genome are non-contiguous, this is indicative of individual exons. 
The results of such an analysis indicate that the sequence of the present invention is the result of a 16 
exon gene contained within the BAC clone AC006208.3. Clearly as the gene of the present invention 
is encoded by 16 non-contiguous exons on chromosome 3, one would not have been able to deduce 
the sequence that encodes the molecules of the present invention without knowing the specific 
sequence. Clearly, the present polynucleotide provides exquisite specificity in locahzing the specific 
region of human chromosome 3 that contains the gene encoding the given polynucleotide, a utility not 
shared by virtually any other nucleic acid sequences. The sequences of the present invention provide 
that necessary specific prior knowledge. In fact, it is this specificity that makes this particular sequence 
so useful. 

Additionally, it should be noted that the gene encoding BAA98132, Exhibit A, identified as 
Homo sapiens semaphorin protein sem2, also maps within the same region of human chromosome 3 
(essentially position 3p3 1 .3 1). Thus in addition to providing direct evidence of the utility of the 
sequences of the present invention in chromosome mapping, this evidence further supports Appellant's 
assertion that the sequences of the present invention encode isoforms of Homo sapiens semaphorin 
protein sem2. 

The Examiner's repeated position that this utility, hke the use of these specific sequences on 
DNA chips or the described polymorphisms in forensic analysis, is that since other molecules can be 
used to map the human chromosome or on DNA chips or in forensic analysis, these utilities are not 
specific or substantial. As described previously above, Appellants once again point out that these 
arguments are completely rebuffed by the Federal Circuit's holding in Carl Zeiss, supra ("[A]n 
invention need not be the best or only way to accomplish a certain result"). Furthermore, the argument 
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that just because there are other objects having the same utiUty, that utihty has been rendered generic 

and therefore invaUd begs the question, previously presented, that don't all golf balls and tires have the 

same utility of other golf balls or tires, i.e. they can be used as golf balls or tires respectively and yet 

these items are readily considered to have patentable utility. 

It has been clearly established that a statement of utility in a specification must be accepted absent 

reasons why one skilled in the art would have reason to doubt the obj ecti ve truth of such statement. In re 

Langer, 503 F.2d 1380, 1391, 183 USPQ 288, 297 (CCPA, 1974; "Langer'')\ In reMarzocchi, 439 

F.2d 220, 224, 169 USPQ 367, 370 (CCPA, 1971). As clearly set forth in Langer: 

As a matter of Patent Office practice, a specification which contains a disclosure of 
utility which corresponds in scope to the subject matter sought to be patented must be 
taken as sufficient to satisfy the utility requirement of § 101 for the entire claimed 
subject matter unless there is a reason for one skilled in the art to question the objective 
truth of the statement of utility or its scope. 

Langer at 297, emphasis in original. As set forth in the MPEP, "Office personnel must provide evidence 
sufficient to show that the statement of asserted utility would be considered * false' by a person of ordinary 
skill in the art" (MPEP, Eighth Edition at 2100-40, emphasis added). 

In the present case. Appellants have provided multiple forms of evidence supporting their assertion 
that the sequences of the present invention encode human semaphorin proteins (isoforms of sem2), 
molecules with specific, substantial and well-estabUshed utility. In contrast, the Examiner has failed to 
provide evidence that the asserted utilities would be considered 'false' by aperson of ordinary skill in the 
art and therefore has failed to provide support for the pending utihty rejections, as required by the Utility 
Guidelines and the law. Thus clearly the rejection of the presently claimed invention under a 35 U.S.C. § 
101 and a 35 U.S.C. § 1 12 first paragraph utility rejection was improper and should be overruled. 

Finally, with fiiU recognition of the fact that all patent applications are examined on their ovra 
merits and that the prosecution of one patent does not effect the prosecution of another patent. In re 
Wertheim, 541 F.2d 257, 264, 191 USPQ 90, 97 (CCPA 1976), however the issue at hand in one of 
whether the fact that patents have issued recognizing the utility of a class of molecules does this confers 
a statutory precedent of patentability to a broad class of compositions. Thus, there remains a lingering 
issue regarding due process and equitable treatment under the law. While Appellants are well aware of 
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the new Utility Guidelines set forth by the USPTO, Appellants respectfully point out that the current 
rules and regulations regarding the examination of patent applications is and always has been the patent 
laws as set forth in 35 U.S.C. and the patent rules as set forth in 37 C.F.R., not the Manual of Patent 
Examination Procedure or particular guidelines for patent examination set forth by the USPTO. 
Furthermore, it is the job of the judiciary, not the USPTO, to interpret these laws and rules. Appellants 
are unaware of any significant recent changes in either 35 U.S.C. § 101, or in the interpretation of 
35 U.S.C. § 101 by the Supreme Court or the Federal Circuit that is in keeping with the new Utility 
Guidelines set forth by the USPTO, This is underscored by numerous patents that have been issued 
over the years that claim nucleic acid fragments that do not comply with the new Utility Guidelines. As 
examples of such issued U.S. Patents, the Examiner is invited to review U.S. Patent Nos. 5,817,479, 
5,654,173, and 5,552,281 (each of which claims short polynucleotides; Exhibits U-W; 
copies of issued U.S. Patents not provided pursuant to current United States Patent and Trademark 
Office policy), and recently issued U.S. Patent No. 6,340,583 (which includes no working examples; 
Exhibit X; copies of issued U.S. Patents not provided pursuant to current United States Patent and 
Trademark Office policy), none of which contain examples of the "real-world" utilities that the Examiner 
appears to desire. Given the rapid pace of development in the biotechnology arts, it is difficult for the 
Appellants to understand how an invention fully disclosed and free of prior art at the time the present 
application was filed, could somehow retain less utility and be less enabled than inventions in the cited 
issued U.S. patents (which were filed during a time when the level of skill in the art was clearly lower). 
Simply put, Appellants' invention is more enabled and retains at least as much utility as the inventions 
described in the claims of the U.S. patents of record. As issued U.S. Patents are presumed to meet aU 
of the requirements for patentability, including 35 U.S.C. §§ 101 and 112, first paragraph, Appellants 
submit that the present polynucleotides must also meet the requirements of 35 U.S.C. § 101. While 
Appellants agree that each application is examined on its own merits. Appellants are unaware of any 
changes to 35 U.S.C. § 101, or in the interpretation of 35 U.S.C. § 101 by the Supreme Court or the 
Federal Circuit, since the issuance of these patents that render the subject matter claimed in these 
patents, which is similar to the subject matter in question in the present application, as suddenly non- 
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statutory or failing to meet the requirements of 35 U.S.C. § 101. Thus, holding Appellants' invention to 
a different standard of utility is inconsistent and inequitable, such a judgement being arbitrary and 
capricious, a violation of due process and equal protection under the law, cannot be maintained. 

Thus in summary, Appellants' application described novel nucleic and amino acid sequences 
that encode human semaphorin proteins (isoforms of sem2), molecules with specific, substantial and 
well-established utility. Semaphorins have, as stated in the specification recognized associations with 
human development and disease. Furthermore, the application also described the tissue specific 
expression pattern and a naturally occurring polymorphism that occurs within the sequences of the 
present invention which provide additional utility. The present situation directly tracks Example 10 of 
the Revised Interim Utility Guidelines Training Materials (pages 53-55), which establishes that a 
rejecfion under 35 U.S.C. § 101. as allegedly lacking a patentable utility and under 35 U.S.C. § 1 12, 
first paragraph as allegedly unusable by the skilled artisan due to the alleged lack of patentable utiUty, is 
not proper when the full length sequence of the invention encodes a protein that has a well known 
fiinction. Therefore, Appellants submit that as the presently claimed sequences have been shown to 
have a substantial, specific, credible and well-established utility, the rejection of the claims under 
35 U.S.C. § 101 and 35 U.S.C. § 112 first paragraph was improper. Thus, Appellants respectfiiUy 
submit that the utility rejection of the pending claims under 35 U.S.C. § 101 and 35 U.S.C. § 1 12 first 
paragraph must be overruled. 

B. Are Claims 1-4, 11 and 12 Unusable Due to a Lack of Patentable Utility? 

The Final Action next rejects claims 1-4, 11 and 12 under 35 U.S.C. § 112, first paragraph, 
since allegedly one skilled in the art would not know how to use the invention, as the invention allegedly 
is not supported by either a clear asserted utihty or a well-established utility. 

The arguments detailed above in Section VIII(A) concerning the utility of the presently claimed 
sequences are incorporated herein by reference. As the Federal Circuit and its predecessor have 
determined that the utihty requirement of Section 101 and the how to use requirement of Section 112, 
first paragraph, have the same basis, specifically the disclosure of a credible utility {In re Brana, supra; 
In re Jolles, 628 F.2d 1322, 1326 n.l 1, 206 USPQ 885, 889 n.l 1 (CCPA 1980); In re Fouche, 
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439 F.2d 1237, 1243, 169 USPQ 429, 434 (CCPA 1971)), Appellants submit that as claims 1-4, 11 
and 12 have been shown to have "a specific, substantial, and credible utility", as detailed in Section 
VIII(A) above, the present rejection of claims 1-4, 11 and 12 under 35 U.S.C. § 1 12, first paragraph, 
cannot stand. 

Appellants therefore submit that the rejection of claims 1-4, 1 1 and 12 under 35 U,S,C. § 1 12, 
first paragraph, must be overruled. 

C. Is Claim 11 indefinite? 

Appellants would like to first apologize in advance to the Board for having to brief this issue, 
which would have been obviated had Appellants previously submitted amendment been entered into 
this case, however as this Brief is in four month extension without an Advisory Action and Appelants 
have been unable to confirm that prior submitted amendments have been entered in this case, rather 
than incur additional fees and potential loss of patent term, Appellants believe it far more prudent to 
have elected to respond at this time. 

Claim 1 1 stands rejected under 35 U.S.C. § 1 12, second paragraph, as being allegedly 
indefinite for recitation of the allegedly indefinite recitation of the phrase "comprising [a] nucleic acid 
sequence of Claim 4." The Examiner interprets this to connote some part of the sequence of the 
nucleic acid sequence of Claim 4 and that replacement of "a" with "the" will resolve the issue. 

Appellants submit that all that is required under 35 U.S.C. § 1 12, second paragraph, is that the 
skilled artisan be apprised both of the utilization and scope of the invention {Shatterproof Glass Corp. 
V. Libbey Owens Ford Co., 225 USPQ 634 (Fed. Cir. 1985)). The Federal Circuit has clearly stated 
in S3 V. Nvidia (259 F.3d 1364 (Fed. Cir. 2001)): 

The requirement that the claims "particularly point out and distinctly claim" the invention is 
met when a person experienced in the field of the invention would understand the scope 
of the subject matter that is patented when the claim is read in conjunction with the rest of 
the specification. "If the claims read in light of the specification reasonably appraise those 
skilled in the art of the scope of the invention, § 1 1 2 demands no more" {Miles Labs,, Inc. 
V. Shandon, Inc., 27 USPQ2d 1123, 1 126 (Fed. Cir. 1993); see also Union Pacific 
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Resources Co. v. Chesapeake Energy Corp., 236 F.3d 684, 692, 57 USPQ2d 1293, 
1297 (Fed. Cir. 2001); North American Vaccine, Inc. v. American Cyanamid Co., 
F.3d 1571, 1579, 28 USPQ2d 1333, 1339 (Fed. Cir. 1993); HybritecK Inc. v. 
Monoclonal Antibodies, 802F.2d 1367, 1385,231 USPQ 81, 94-95 (Fed. Cir. 1986). 

Appellants understand that while traditional dependant claim construction would suggest that the 
use ofthe term "the" in Claim 1 1 would be technically correct. The present situation differs in that due to 
the degeneracy of the nucleic acid triplet codons, more than one nucleic acid sequence can encode a single 
amino acid sequence. This degeneracy has been known to those of skill in the art since the 1 960s and is 
accepted by those of skill in the art. Thus while finite the a nucleotide sequence that encodes the amino 
acid sequence shown in SEQ ID NO 4", does not represent a single nucleic acid sequence. Therefore, 
the use of the word "the" in this context would have been grammatically incorrect. 

While in no way agreeing with the Examiner' s rej ection, in order to advance this appUcation more 
quickly towards allowance. Appellants submitted an amendment to Claim 1 1 in the belief that in all but the 
most tortured reading of the claim, one of skill in the art would readily understand that whether phrased as 
"the" nucleic acid or "a" nucleic acid were used in the context of Claim 1 1 , it would be interpreted as 
meaning any of the nucleic acid sequences defined by Claim 4, that is to say any of the nucleic acid 
sequences that encodes the amino acid sequence shown in SEQ E) NO 4. Thus, those of skill in the art 
would have recognized both the utilization and scope of the invention as set forth in Claim 1 1 , and therefore. 
Claim 1 1 meets the requirements of 35 U.S.C. § 1 1 2, second paragraph and this rejection should be 
overtumed. 
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IX. APPENDIX 

The claims involved in this appeal are as follows: 



1 . An isolated nucleic acid molecule comprising the nucleotide sequence of SEQ 
ID NO: 1. 

2. An isolated nucleic acid molecule comprising a nucleotide sequence that: 

(a) encodes the amino acid sequence shown in SEQ ID NO: 2; and 

(b) hybridizes under stringent conditions to the nucleotide sequence of SEQ 
ED NO: 1 or the complement thereof 

3. An isolated nucleic acid molecule comprising a nucleotide sequence that 
encodes the amino acid sequence shown in SEQ ID NO: 2. 

4. An isolated nucleic acid molecule comprising a nucleotide sequence that 
encodes the amino acid sequence shown in SEQ ID N0:4. 

11. An expression vector comprising a nucleic acid sequence of Claim 4. 

12. A cell comprising the expression vector of Claim 1 1 . 
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X. CONCLUSION 

Appellants respectfully submit that, in light of the foregoing arguments, the Final Action's 
conclusion that claims 1-4, 11 and 12 lack a patentable utility and are unusable by the skilled artisan 
due to a lack of patentable utility is unwarranted. It is therefore requested that the Board overturn the 
Final Action's rejections. 



Respectfully submitted. 



March 26. 2004 iCM^<j^^/^/tic>'^(^\^ 
Date Lance K. Ishimoto ^ Reg. No. 41,866 
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Compare Genomic Sequence. 



Page 1 of 2 



FASTA searches a protein or DNA sequence data bank 
version 3.3t05 March 30, 2000 
Please cite: 

W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 
/tmp/fastaCAANHaihS: 874 aa 



vs /tmp/fastaDAAOHaihS library 
searching / tmp/f astaDAAOHaihS library 

782 residues in 1 sequences 

FASTA (3.34 January 2000) function [optimized, BL50 matrix (15: -5)] ktup: 2 

join: 38, opt: 26, gap-pen: -12/ -2, width: 16 

Scan time:. 0.017 
The best scores are: opt 
gi|8978202|dbj |BAA98132.1| semaphorin sem2 [Homo { 782) 5450 

{^^S:^l^§MSS^W^^p^S3S^^ [Homo sapie (782 aa) 

initn: 5448 initl: 4266 opt: 5450 
Smith-Waterman score: 5450; 99.872% identity in 782 aa overlap (94-874:1-782) 



70 80 90 100 110 120 

SEQ GGSRANYNRRPAGPEGGSAGRRQRCPQFPSMAPSAWAICWLLGGLLLHGGSSGPSPGPSV 

gi I 897 MAPSAWAICWLLGGLLLHGGSSGPSPGPSV 

10 20 30 

130 140 150 160 170 180 

SEQ PRLRLSYRDLLSANRSAIFLGPQGSLNLQAMYLDEYRDRLFLGGLDALYSLRLDQAWPDP 

gi I 897 PRLRLSYRDLLSANRSAIFLGPQGSLNLQAMYLDEYRDRLFLGGLDALYSLRLDQAWPDP 
40 50 60 70 80 90 

190 200 210 220 230 240 

SEQ REVLWPPQPGQREECVRKGRDPLTECANFVRVLQPHNRTHLLACGTGAFQPTCALITVGH 

gi I 897 REVLWPPQPGQREECVRKGRDPLTECANFVRVLQPHNRTHLLACGTGAFQPTCALITVGH 
100 110 120 130 140 150 

250 260 270 280 290 300 

SEQ RGEHVLHLEPGSVESGRGRCPHEPSRPFASTFIDGELYTGLTADFLGREAMIFRSGGPRP 

gi I 897 RGEHVLHLEPGSVESGRGRCPHEPSRPFASTFIDGELYTGLTADFLGREAMIFRSGGPRP 
160 170 180 190 200 210 

310 320 330 340 350 360 

SEQ ALRSDSDQSLLHDPRFVMAARI PENSDQDNDKVYFFFSETVPS PDGGSNHVTVSRVGRVC 

:::::::::::::::::::::::::::::::::::::!•••!••••••'•••••""•••• 

gi I 897 ALRSDSDQSLLHDPRFVMAARIPENSDQDNDKVYFFFSETVPSPDGGSNHVTVSRVGRVC 
220 230 240 250 260 270 

370 380 390 400 410 420 

SEQ VNDAGGQRVLVNKWSTFLKARLVCSVPGPGGAETHFDQLEDVFLLWPKAGKSLEVYALFS 

gi I 897 VNDAGGQRVLVNKWSTFLKARLVCSVPGPGGAETHFDQLEDVFLLWPKAGKSLEVYALFS 
280 290 300 310 320 330 

430 440 450 460 470 480 

SEQ TVSAVFQGFAVCVYHMADIWEVFNGPFAHRDGPQHQWGPYGGKVPFPRPGVCPSKMTAQP 
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gi I 897 TVSAVFQGFAVCVYHMADIWEVFNGPFAHRDGPQHQWGPYGGKVPFPRPGVCPSKMTAQP 
340 350 360 370 380 390 

490 500 510 520 530 540 

SEQ GRPFGSTKDYPDEVLQFARAHPLMFWPVRPRHGRPVLVKTHLAQQLHQIWDRVEAEDGT 

gi I 897 GRPFGSTKDYPDEVLQFARAHPLMBVPVRPRHGRPVLVKTHLAQQLHQIWDRVEAEDGT 
400 410 420 430 440 450 

550 560 . 570 580 590 600 

SEQ YDVIFLGTDSGSVLKVIALQAGGSAEPEEWLEELQVFKVPTPITEMEISVKRQMLYVGS 

gi I 897 YDVIFLGTDSGSVLKVIALQAGGSAEPEEWLEELQVFKVPTPITEMEISVKRQMLYVGS 
460 470 480 490 500 510 

610 620 630 640 650 660 

SEQ RLGVAQLRLHQCETYGTACAECCLARDPYCAWDGASCTHYRPSLGKRRFRRQDIRHGNPA 

gi I 897 RLGVAQLRLHQCETYGTACAECCLARDPYCAWDGASCTHYRPSLGKRRFRRQDIRHGNPA 
520 530 540 550 560 570 

670 680 690 700 710 720 

SEQ LQCLGQSQEEEAVGLVAATMVYGTEHNSTFLECLPKSP-AAVRWLLQRPGDEGPDQVKTD 

gi I 897 LQCLGQSQEEEAVGLVAATMVYGTEHNSTFLECLPKSPQAAVRWLLQRPGDEGPDQVKTD 
580 590 600 610 620 630 

730 740 750 760 . 770 780 

SEQ ERVLHTERGLLFRRLSRFDAGTYTCTTLEHGFSQTWRLALWIVASQLDNIiFPPEPKPE 

gi I 897 ERVLHTERGLLFRRLSRFDAGTYTCTTLEHGFSQTVVRLALVVIVASQLDNLFPPEPKPE 
640 650 660 670 680 690 

790 800 810 820 830 840 

SEQ EPPARGGLASTPPKAWYKDILQLIGFANLPRVDEYCERVWCRGTTECSGCFRSRSRGKQA 

gi I 897 EPPARGGLASTPPKAWYKDILQLIGFANLPRVDEYCERVWCRGTTECSGCFRSRSRGKQA 
700 710 720 730 740 750 

850 860 870 

SEQ RGKSWAGLELGKKMKSRVHAEHNRTPREVEAT 



gi|897 RGKSWAGLELGKKMKSRVHAEHNRTPREVEAT 
760 770 780 



874 residues in 1 query sequences 
782 residues in 1 library sequences 
Scomplib [version 3,3t05 March 30, 2000] 

start: Mon Feb 10 16:12:17 2003 done: Men Feb 10 16:12:17 2003 
Scan time: 0.017 Display time: 0.933 

Function used was FASTA 
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FASTA searches a protein or DNA sequence data bank 
version 3.3t05 March 30, 2000 
please citei 

W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 

/tmp/fastaCAAWTaiDv: 2349 nt 

>LEX151 SEQ ID N0:3 

vs /tmp/fastaDAAXTaiDv library 
searching /tmp/fastaDAAXTaiDv library 

4700 residues in 1 sequences 

FASTA (3:34 January 2000) function ■ [optimized, +5/-4 matrix (5 : -4) ] . ktup: 6 

join: 74, opt: 59, gap-pen: -16/ -4, width: 16 

Scan time: 0.100 
The best scores are: i??>io 
gi|897820l|dbj|AB029496.1| Homo sapiens mRNA f (4700) [f] 11742- 
gi|897820l|dbj |AB029496.1| Homo sapiens mRNA f (4700) [r] 78 

»gi|897820l|dbj|AB029496.l| Homo sapiens mRNA for semap (4700 nt) 
initn: 11742 initl: 11742 opt: 11742 
99.957% identity in 2349 nt overlap (1-2349:1-2349) 

10 20 30 40 50 60 

LEX151 ATGGCCCCCTCGGCCTGGGCCATTTGCTGGCTGCTAGGGGGCCTCCTGCTCCATGGGGGT 

Qi I 897 ATGGCCCCCTCGGC^ . 
' 10 20. 30 . 40 50 / 60 

70 80 90 100 110 120 

LEX151 AGCTCTGGCCCCAGCCCCGGCCCCAGTGTGCCCCGCCTGCGGCTCTCCTACCGAGACCTC 
••.•••.••••••••••"••"•••*••"*:::::::::••*•••••**•••*""*''*** 

gi I 897 AGCTCTGGCCCCAGCCCC^ 

' 70 80 90 100 110 120 

130 140 150 160 170 180 

LEX151 CTGTCTGCCAACCGCTCTGCCATCTTTCTGGGCCCCCAGGGCTCCCTGAACCTCCAGGCC 

oi 1 897 CTGTCTGCCAACCGCTCTC^ 

' . 130 . 140 150 160 170 180 

190 200 210 220 230 240 

LEX151 ATGTACCTAGATGAGTACCGAGACCGCCTCTTTCTGGGTGGCCTGGACGCCCTCTACTCT 

gi 1 8 97 ATCTACCTAGATCAGTACCG^^ 

' 190 200 210 220 230 240 

250 260 270 280 290 300 

LEX151 CTGCGGCTGGACCAGGCATGGCCAGATCCCCGGGAGGTCCTGTGGCCACCGCAGCCAGGA 
,,,,,,,,»• •••••••'•••••"•••••"•jji:::::::::. •::::•••••••"••* 

gi|897 CTCCGGCTCGACCAGGCATG^^ 

' 250 260 270 280 290 300 

310 320 330 340 350 360 

LEX151 CAGAGGGAGGAGTGTGTTCGAAAGGGAAGAGATCCTTTGACAGAGTGCGCCAACTTCGTG 
,,,.••••••••••••••••••••*•*•**'::::::*:•••"•••"••"""******* 

gi 1 897 CAGAGGGAGGAGTGTCrrcCAA^ 

' 310 320 330 340 350 360 



370 



380 390 400 410 420 
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.B^151CG«=TOCTACASCC1CAC«CCGGACCCACC«CTAGCCTSTGGCACT0GGGCC^^^^ 
,i|897 CGGG;GCTACAGC™^^^^ 

.HXisi cccacct^ctcat"2a™a2c~^-^^^ 

520 540 

.«ci5i'<^ag«=^^aaa=«g"g=gggcgg.^c™c=^^^^^ 

gi|897 GGCAGiGTGGAAAGi^c|GGG^^^ 

560 570 580 590 600 

.ex151acc™acgggga^«tacacgggtctcactgctgact^c™agag^^^ 

,i 1 897 ACC™;™™ 

' 550 560 570 580 

i,kx151a«=atc™?gaag«=ga^tcctcggccLtc«=cg™c^^^^^ 

,i|89, A^^A^C^CgAaG^GGAGG^C^^^^ 

' .610 620 630 640 

.EX15i™CACGi2?CCCGGT™?GATGGCC™TCcj;kGAAC™^^ 

,i 1 897 ^ACCCCCGG^^i^TGA^GicGCC^^^ 
' 670 680 690 700 

■„c 740 750 760 770 780 

i,kxi51GacaaggJ^?actk^i^t«:tcggagacgg«:ccctcgcccga,ggiggctcgaac^^^ 

,i|897 ^^GG;G;AC™™™;CGGAGACG^^^^^ 

' 730 - 740 750 760 

LEX151 gkactgJ?agccgcgt^2ccgcgtc?^?gtgaatga?^ctggggg?»^^^ 

,i|897 G^CACT^AG^G^C^^^^^ 

. ' 790 800 810 820 

.exi51Gtgaaca^?ggagcac???cc^gS?aggctgg"?gctcgg«^^^ 

,11897 GTGAAC^ApGCACT^ciclA^^^^ 

Qon 930 940 950 960 

LEX151GGTGCCGA»CCCAC™ACCAGCTAGAGGA™™rrCC^TGTGGCCCAAGGCCG^^ 

,11897 Gii4c^pCAci4T|cCAGCT^^ 

970 980 9.90 1000 1010 1020 



Compare Genomic Sequences 

LEX151AAGAGCCTCGAGGI^ACGCGCTSTTCAGCACCGTCA=TGCC=TGTTC»^^ 
,i|397 «0AGCCTCGA™^ 

.EX151 GTCT<n^?^ACCACAT^AGACATi?^?GA0G^^CGGSCc2J^^ 

,i 1 89, G^i^TG^^TACCAGATGicAi^^^^ 

' 1030 1040 1050 1060 

XEX151GA-rGGGj???AG0AeC^?^GGGCCCTA?GGGGGC«^^^^^^ 

.ile^VGA^GGCC^^AGiACCAG^GGGGGC^^^ 

' 1090 1100 1110 1120 

,KX151GTGTGcicCAGCAAGAijACCGCACAS2lGGACGGji?^^ 
,i|S97 G;i;GCCGp«GATGA^^^^^ 

,„n 1220 1230 1240 1250 1260 

LEXISI CCAGATCAGGTGC^KAGTrreCCCGAGCCGACCCCCTCATGrrCTGGCCTGI^GGCCT 

,i|897 C«GA;GAGG™^CCaA^^^ 

' 1210 1220 1230 1240 / 

1570 1280 1290 1300 1310 1320 

Lm51CGACATjGC?GCCCl^TCGTTGTCAAGACGCACCTGGCCCAGCAGCTACACCAGATCGTC 

,11897 ^GACA^aGCCGGC^^G^^^ 

inO 1340 1350 1360 1370 1380 

l,BX151GTGGAciG?GTGGAGGGAGAGGATGGGACCTACGATGTGATTrrCCTGGGGAGTGACT^^ 

,1,897 i;4iAG™AGicAGA«A;G^^^ 

1330* 1340 1350 J-Jo 

i,EXi5i gggtci4tc?tcaaagtc!?cgctct^aggcaggg^?cagctg1^cc^ 

gi|897 GGGTC^TGi^A^AiT^TCG^^^^ 

liSO 1460 1470 1480 1490 1500 

lex151Gttc,^a^agckcaggtgtttaaggtgccaacacgta«accgaaa™gagatctct 

,i|897 Gi^i^GApG^GCAGGT^^ 

1520 1530 1540 1550 1560 

LE.151GTCAAAi^?AAATGGTATACGTGGGCTCTGGGGTGGGreT=GGCCAGCTGGGGGTGCAC 

,1,897 G^C^AAAGGCaAa^^C^TAGG^^ 

1570 1580 1590 1600 1610 1620 
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LEX151 CAATGTGAGACTTACGGCACTGCCTGTGCAGAGTGCTGCCTGGCCCGGGACCCATACTCT 
gi I 897 cAATGTGAGACTiACGGCACTGC^ 

^ ' 1570 1580 1590 1600 1610 1620 

. 1630 1640 1650 1660 1670 1680 

LEX151 GCCTGGGATGGTGCCTCCTGTACCCACTACCGCCCCAGCCTTGGCAAGCGCCGGTTCCGC 

gi I 897 GCCTGGGATGGTGCCTCCTC 

^ ' 1630 1640 1650 1660 1670 1680 

1690 1700 1710 1720 1730,. . 1740 

LEX151 CGGCAGGACATCCGGCACGGCAACCCTGCCCTGCAGTGCCTGGGCCAGAGCCAGGAAGAA 

gi I 897 CGGCAGgAcaTCCGGcAcGGC^^ 

^ ' 1690 1700 1710 1720 1730 1740 

1750 1760 1770 1780 1790 1800 

LEX151 GAGGCAGTGGGACTTGTGGCAGCCACCATGGTCTACGGCACGGAGCACAATAGCACCTTC 

gi I 897 GAGGCAGTGGGACTTGTGG^ 

' 1750 1760 1770 1780 1790 1800 

1810 1820 1830 1840 1850 I860 

LEX151 CTGGAGTGCCTGCCCAAGTCTCCCCARGCTGCTGTGCGCTGGGTCTTGCAGAGGCCAGGG 

: i :::::: i I ' 

i I 897 CTGGAGTGCCTGCCCAAGTCTCCCCAGGCTGGTGTGCGCTGGCTCTTGCAGAGGCC^^ 

1810 1820 .1830 1840 1850 ; I860 

1870 1880 1890 1900 1910 1920 

LEX151 GATGAGGGGCCTGACCAGGTGAAGACGGACGAGCGAGTCTTGCACACGGAGCGGGGGCTG 

:::::::::::::::::::::::::: 5 ^ 

gi I 897 GATGAGGGGCCTGACCAGGTGAAGACGGACGAGCGAGTCTTGCACACGGAGCGGGGGCTG 

' 1870 1880 1890 1900 1910 1920 

1930 1940 1950 I960 1970 1980 

LEX151 CTGTTCCGCAGGCTTAGCCGTTTCGATGCGGGCACCTACACCTGCACCACTCTGGAGCAT 

gi I 897 CTGTTCCGCAGGCTTA^ 

' . 1930- 1940 1950 I960 1970 1980 

1990 2000 2010 2020 2030 2040 

LEX151 GGCTTCTCCCAGACTGTGGTCCGCCTGGCTCTGGTGGTGATTGTGGCCTCACAGCTGGAC 

gi 1 897 GGCTTCTCCCAGACTC^ 

^ ' 1990 2000 2010 2020 2030 2040 

2050 2060 2070 2080 2090 2100 

LEX151 AACCTGTTCCCTCCGGAGCCAAAGCCAGAGGAGCCCCGAGCCCGGGGAGGCCTGGCTTCC 

gi I 897 AACCTGrrCCCTCCGGA^ 

' 2050 2060 2070 2080 2090 2100 

. 2110 2120 2130 2140 2150 2160 

LEX151 ACCCCACCCAAGGCCTGGTACAAGGACATCCTGCAGCTCATTGGCTTCGCCAACCTGCCC 

gi I 897 AcCCCACCCAAGGCCTGGT^^ 

' 2110 2120 2130 2140 2150 2160 

2170 2180 2190 2200 2210 2220 
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LEX151 CGGGTGGATCAGTACTGTGAGCGCGTGTGGTGCAGGGGCACCACGGAATGCTCAGGCT 
gi 1 897 CGGGTGGATGAGTACTC^^^ 

^ ' 2170 2180 2190 2200 2210 2220 

2230 2240 2250 2260 2270 2280 

LEX151 TTCCGGAGCCGGAGCCGGGGCAAGCAGGCCAGGGGCAAGAGCTGGGCAGGGC^ 

gi 1 897 TTCCGGAGCCGGAGCCG^^^^^ 

^ ' 2230 2240 2250 2260 2270 2280 

2290 2300 2310 2320 . 2330 2340 

XEXISI.GGCAAGAAGATGAAGAGCCGGGTGCATGCCGAGCACAATCGGACGCCCCGG^^^^ 

gi I 897 GGCAAGAAGATG^^^ 



LEX151 GCCACGTAG 

gi 1 897 GCCACGTAGAAGGGGGCAGAGGAGGGGTGOT^^^ 
2350 2360 

»gi|897820l|dbj|AB029496.1| Homo sapiens mRNA for semap (4700 nt) 
rev-comp initn: 136 initl: 78 opt: 78 
85.714% identity in 21 nt overlap (875-855:476-496) 

900 890 880 .870 860 „ 

LEX15- GCACCACCAGGGCCGGGCACCGAGCAGACCAGCCTGGCCTTGAGGAAAGTGCTCCATTTG 

• *•"•••••• 

gi I 897 GCCACCGTGGGGAGCATGTGCTCCACCTGGAGCCTGGCAGTGTGGAAAGTGGCCGGGGGC 
' 460 470 480 490 500 



450 



840 830 820 810 800 790 

LEX15- TTCACCAGCACCCGCTGGCCCCCAGCATCATTCACGCAGACGCGGCCCACGCGGCTGACA. 

gi I 897 GGTGCCCTCACGAGCCCAGCCGTCCCTTTGCCAGCACCTTCATAGACGGGGAGCTGTACA 
y ' .520 530 540 550 560 



510 



2349 residues in 1 query sequences 

4700 residues in 1 library sequences 
Scomplib [version 3.3t05 March .30. 2000] vi.sr.42 2003 

start: Fri Sep 19 13:51:42 2003 done: Fri Sep 19 13:51:42 2003 
Scan time: 0.100 Display time: 0.150 

Function used was FASTA 
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PubMed Nucleotide 

Search | Nucleotide 




Protein 



Genome 



for[ 



Limits 




Taxonomy 

Clipboard 



OMIM 



Boo 



Preview/Index History Clipboard Details 

show:|2oji telBl [ File gj mmmmkmmm^ 



FASTA 



□ 1: AB029496 . Homo sapiens mRNA...[gi:8978201] 



Links 



LOCUS 

DEFINITION 

ACCESSION 

VERSION 

KEYWORDS 

SOURCE 

ORGANISM 



REFERENCE 
AUTHORS 

TITLE 
JOURNAL 
REFERENCE 
. AUTHORS 

TITLE 
JOURNAL 



FEATURES 

source 



gene 
CDS 



AB029496 4700 bp mRNA linear PRI 07-JUL-2000 

Homo sapiens mRNA for semaphorin sein2, complete cds. 

AB029496 

AB029496,1 GI:8978201 
semaphorin sem2 , 
Homo sapiens (hiaman) 
Homo sapiens 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi ; 
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 

1 (bases 1 to 4700) 

Seki,N., Hat tori. A., Hayashi,A. , Koz\aina,S., Muramatsu,M. , 
Miyajima,N. and Saito,T. 
Human semaphorin 

Published Only in DataBase (2000) 

2 (bases 1 to 4700) y 
Seki,N., Hattori,A., Hayashi,A. , Kozuma,S., Muramatsu,M. , 
Miyajima^N. and Saito,T, 

Direct Submission 

Submitted (Ol-JUL-1999) Toshiyuki Saito, National Institute of 
Radiological Sciences, Genome Research Group; Inage-ku Anagawa 
4-9-1, Chiba, Chiba 263-8555, Japan (E-mail: t„saito@nirs. go. jp, 
Tel: 81-43-201-3135, Fax:81-43-251-9818) 

Location/Qualifiers 

1..4700 

/organism=''Homo sapiens" 
/mo l„type = " mRNA " 
/db_xref = " taxon : 9 60 6 " 
1. .4700 
/gene="sem2'* 
1, .2349 

/gene="sem2" ' 

/codon„start=l 

/product=" semaphorin sem2" 

/protein id=" BAA98132 . 1 " 

/db_xref="GI: 8978202" 

/translation=«MAPSAWAICWLLGGLLLHGGSSGPSPGPSVPRLRLSYRDLLSAN 
RSAIFLGPQGSLNLQAMYLDEYRDRLFLGGLDALYSLRLDQAWPDPREVLWPPQPGQR 
EECVRKGRDPLTECANFVRVLQPHNRTHLLACGTGAFQPTCALITVGHRGEHVLHLEP 
GSVESGRGRCPHEPSRPFASTFIDGELYTGLTADFLGREAMIFRSGGPRPALRSDSDQ 
SLLHDPRFVMAARIPENSDQDNDKVYFFFSETVPSPDGGSNHVTVSRVGRVCVNDAGG 
QRVLVNKWSTFLKARLVCSVPGPGGAETHFDQLEDVFLLWPKAGKSLEVYALFSTVSA 
VFQGFAVCVYHMADIWEVFNGPFAHRDGPQHQWGPYGGKVPFPRPGVCPSKMTAQPGR 
PFGSTKDYPDEVLQFARAHPLMFWPVRPRHGRPVLVKTHLAQQLHQIWDRVEAEDGT 
YDVIFLGTDSGSVLKVIALQAGGSAEPEEWLEELQVFKVPTPITEMEISVKRQMLYV 
GSRLGVAQLRLHQCETYGTACAECCLARDPYCAWDGfASCTHYRPSLGKRRFRRQDIRH 
GNPALQCLGQSQEEEAVGLVAATMVYGTEHNSTFLECLPKSPQAAVRWLLQRPGDEGP 
DQVKTDERVLHTERGLLFRRLSRFDAGTYTCTTLEHGFSQTWRLALWIVASQLDNL 
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FPPEPKPEEPPARGGLASTPPKAWYKDILQLIGFANLPRVDEYCERVWCRGTTECSGC 
FRSRSRGKQARGKSWAGLELGKKMKSRVHAEHNRTPREVEAT " 

BASE COUNT 972 a 1307 c 1467 g 954 t 

ORIGIN ^ 3,.ggccccct cggcctgggc catttgctgg ctgctagggg gcctcctgct ccatgggggt 
61 agctctggcc ccagccccgg ccccagtgtg ccccgcctgc ggctctccta ccgagacctc 
121 ctgtctgcca accgctctgc catctttctg ggcccccagg gctccctgaa cctccaggcc 
181 atgtacctag atgagtaccg agaccgcctc tttctgggtg gcctggacgc cctctactct 
241 ctgcggctgg accaggcatg gccagatccc cgggaggtcc tgtggccacc gcagccagga 
301 cagagggagg agtgtgttcg aaagggaaga gatcctttga cagagtgcgc caacttcgtg 
361 cgggtgctac agcctcacaa ccggacccac ctgctagcct gtggcactgg ggccttccag 
421 cccacctgtg ccctcatcac agttggccac cgtggggagc atgtgctcca cctggagcct 
481 ggcagtgtgg aaagtggccg ggggcggtgc cctcacgagc ccagccgtcc ctttgccagc 
541 accttcatag acggggagct gtacacgggt ctcactgctg acttcctggg gcgagaggcc 
601 atgatcttcc gaagtggagg tcctcggcca gctctgcgtt ccgactctga ccagagtctc 
661 ttgcacgacc cccggtttgt gatggccgcc cggatccctg agaactctga ccaggacaat 
721 gacaaggtgt acttcttctt ctcggagacg gtcccctcgc ccgatggtgg ctcgaaccat 
781 gtcactgtca gccgcgtggg ccgcgtctgc gtgaatgatg ctgggggcca gcgggtgctg 
841 gtgaacaaat ggagcacttt cctcaaggcc aggctggtct gctcggtgcc cggccctggt 
901 ggtgccgaga cccactttga ccagctagag gatgtgttcc tgctgtggcc caaggccggg 
961 aagagcctcg aggtgtacgc gctgttcagc accgtcagtg ccgtgttcca gggcttcgcc 
1021 gtctgtgtgt accacatggc agacatctgg gaggttttca acgggccctt tgcccaccga 
1081 gatgggcctc agcaccagtg ggggccctat gggggcaagg tgcccttccc tcgccctggc 
1141 gtgtgcccca gcaagatgac cgcacagcca ggacggcctt ttggcagcac caaggactac 
1201 ccagatgagg tgctgcagtt tgcccgagcc caccccctca tgttctggcc tgtgcggcct 
1261 cgacatggcc gccctgtcct tgtcaagacc cacctggccc agcagctaca ccagatcgtg 
1321 gtggaccgcg tggaggcaga ggatgggacc tacgatgtca ttttcctggg gactgactca 
1381 gggtctgtgc tcaaagtcat cgctctccag gcagggggct cagctg^acc tgaggaagtg 
1441 gttctggagg agctccaggt gtttaaggtg ccaacaccta tcaccg^aat ggagatctct 
1501 gtcaaaaggc aaatgctata cgtgggctct cggctgggtg tggcccagct gcggctgcac 
1561 caatgtgaga cttacggcac tgcctgtgca gagtgctgcc tggcccggga cccatactgt 
1621 gcctgggatg gtgcctcctg tacccactac cgccccagcc ttggcaagcg ccggttccgc 
1681 cggcaggaca tccggcacgg caaccctgcc ctgcagtgcc tgggccagag ccaggaagaa 
1741 gaggcagtgg gacttgtggc agccaccatg gtctacggca cggagcacaa tagcaccttc 
1801 ctggagtgcc tgcccaagtc tccccaggct gctgtgcgct ggctcttgca gaggccaggg 
1861 gatgaggggc ctgaccaggt gaagacggac gagcgagtct tgcacacgga gcgggggctg 
1921 ctgttccgca ggcttagccg tttcgatgcg ggcacctaca cctgcaccac tctggagcat 
1981 ggcttctccc agactgtggt ccgcctggct ctggtggtga ttgtggcctc acagctggac 
2041 aacctgttcc ctccggagcc aaagccagag gagcccccag cccggggagg cctggcttcc 
2101 accccacoca aggcctggta caaggacatc ctgcagctca ttggcttcgc caacctgccc 
2161 cgggtggatg agtactgtga gcgcgtgtgg tgcaggggca ccacggaatg ctcaggctgc 
2221 ttccggagcc ggagccgggg caagcaggcc aggggcaaga gctgggcagg gctggagcta 
2281 ggcaagaaga tgaagagccg ggtgcatgcc gagcacaatc ggacgccccg ggaggtggag 
2341 gccacgtaga agggggcaga ggaggggtgg tcaggatggg ctggggggcc cactagcagc 
2401 ccccagcatc tcccacccac ccagctaggg cagaggggtc aggatgtcfcg tttgcctctt 
2461 agagacaggt gtctctgccc ccacaccgct actggggtct aatggagggg ctgggttctt 
2521 gaagcctgtt ccctgccctt ctctgtgctc ttagacccag ctggagccag caccctctgg 
2581 ctgctggcag ccccaaggga tctgccattt gttctcagag atggcctggc ttccgcaaca 
2641 catttccggg tgtgcccaga ggcaagaggg ttgggtggtt ctttcccagc ctacagaaca 
2701 atggccattc tgagtgaccc tcagagtggg tgtgtgggtg. cgtctagggg gtatcccggt 
2761 agggggcctg cagggagcca gagggtggaa atggcctcta agctagcacc ccgtaagaag 
2821 agcctacctg accgacttgg ggagggaaca cagaggtgtt gggaaggtgg agcaacaatg 
2881 cacctcccct cctgtcgcgc cgtgatatct tggtggctcc ctgccactgc ccaccgcctc 
2941 ttctccatct gagaatcacg gagaggtgta gataatctag aggcatagac tgctagagcc 
3001 cccagggatc tggggtggtc agggctcagg cttcactttg taaaccaggt gggggcatct 
3061 cacagcctga cttcccttcc ccaggccagg gttgctggga tgcctgcccc tcctgagagg 
3121 accccctccc cattgtcagg ctctccatgt ccacgagcgg ggaggggtgg gttctggggc 
3181 attgttgtcc cttgtgtctg tggactagag atagggtggg ggagctgggg aagggtgcag 
3241 gcgggaagag tgggctgtct ttcccagggt gatgcaagca tgccgcagcc ctggaggctg 
3301 ggaatgtgga ggctctgtga gccctgcagc cctcagaatc agggccaggg atgcagaaga 
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3361 
3421 
3481 
3541 
3601 
3661 
3721 
3781 
3841 
3901 
3961 
4021 
4081 
414i 
4201 
4261 
4321 
4381 
4441 
4501 
4561 
4621 
4681 



ttgagaggat 
ggcaggaaca 
aagacccagt 
gtttttGttt 
attctcgggg 
gtttttgcca: 
gaaaagttgt 
agataagagc 
ggaaaaaaag 
gccttgggtt 
tccctccgtt 
tcaacgccct 
ggcagaggag 
gggattaaag 
agctctgccc 
ctggactctg 
agggaggagg 
ttctgccaac 
aaatgcgagt 
gagcacattt 
gtgagaaaag 
atcattactg 
aatgaagagc 



atggagatgg 
ggtgtccaca 
gtttccatct 
taagggggaa 
gtaaggctcg 
tcaccagttt 
tcccagcctg 
actgggtttg 
ataaaaagca 
ttatctttcc 
ctcccctttg 
gagaagcctt 
gagaggagga 
aggggaggag 
tctaccctag 
gcttggcagg 
agcacaagat 
cacaccctac 
tgtttttgta 
cttgtaatta 
ctgaatttac 
tgtatctgtg 
cctcccatcc 



atagagggca 
agaactcagg 
ctggaatctc 
acaaggtaga 
gatggcaagg 
ctcaggctgg 
cacatgaaca 
agattccctc 
agccagggtt 
cttacccctg 
accatgtaat 
ccagcctgcg 
aaggatgggg 
agagtgcaga 
ggaggccaga 
ctccaggcag 
cctcagcaac 
ccatggtact 
tttgtgtgtt 
ctattgttat 
aaggaaaggg 
tattgtacta 



ggagaccctt 
atggcatcag 
tgttttatgc 
gaaaaggacg 
acgcgttctg 
ggagcacaga 
cattcatgac. 
cattaaaaca 
ccctgcccta 
gcacctccag 
aaatgaacca 
gtgctgtctg 
gctgaagagc 
gctccaggaa 
aagacacaaa 
ggtcctctgg 
gaacacctgc 
gtatgctatt 
gagatgggcc 
ttttattgtc 
atgaagttaa 
aatggactga 



aggatagatt 
ttagctcaga 
taaatggatt 
aagaagtgta 
cctgggcatg 
ggggaggagg 
acacaaaact 
accaagacaa 
ttgaaactca 
agaactggga 
gaagcactga 
ctgggaggtc 
. agaagggagg 
agggtatcag 
cagccctccg 
gaagttactc 
acttagaaaa* 
aactcctgga 
ttgtggtttc 
atgactgccc 
tatttgcatc 
tgctgcgcac 



gtgggaccca 
agccacctgg 
taggaagact 
agtcccgctg 
taggggaggt 
aggactaaat 
ggctggaagg 
agaaaggagg 
aacccagact 
cctgaaatag 
gattaaccta 
agctggtcaa 
ggagacagag 
agctgcagcc 
ggcctttacg 
tagaaaacga 
agtggacagc 
aacgccccgt 
tctgtactca 
ctgagctctg 
acataattat 
atgagctgaa 
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FASTA searches a protein or DNA sequence data bank 
version 3.3t05 March 30, 2000 
Please cite: 

W.R. Pearson & D.J. Lipman PNAS .(1988) 85:2444-2448 

/tmp/fastaGAALlaqDv: 2628 nt 

>LEX151 SEQ ID N0:1 . 

vs Ztmp/fastaHAAMlaqDv library 
searching /tmp/f astaHAAMlaqDv library 

4700 residues in 1 sequences 

FASTA (3.-34 January 2000) function [optimized, +5/-4 matrix (5: -4) ] ktup: 6 

join: 77, opt: 62, gap-pen: -16/ -4, width: 16 

Scan time: 0.117 
The best scores are: opt 
gi|897820l|dbj |AB029496.1| Homo sapiens mRNA f (4700) [f] 11742 * 
gi|8978201|dbj |AB029496.1| Homo sapiens mRNA f (4700) [r] 95 

»gi|8978201|dbj |AB029496.1| Homo sapiens mRNA for semap (4700 nt) 
initri: 11742 initl: 11742 opt: 11742 

99.957% identity in 2349 nt overlap (280-2628:1-2349) 

250 260 270 280 290 300 

LEXl 5 1 AGGCGGCAGCGGTGCCCTCAGTTCCCCAGCATGGCCCCCTCGGCCTGGGCC ATTTGCTGG 

gi I 8 9 7 ATGGCCCCCTCGGCCTGGGCCATTTGCTGG 

10 20 / 30 

310 320 330 340 350 360 

LEXl 5 1 CTGCTAGGGGGCCTCCTGCTCCATGGGGGTAGCTCTGGCCCCAGCCCCGGCCCCAGTGTG 

::::::::::::::::::::::::::::::::::::::::::::: — :: — : — — — — 
gi I 897 CTGCTAGGGGGCCTCCTGCTCCATGGGGGTAGCTCTGGCCCCAGCCCCGGCCCCAGTGTG 
40 50 60 70 80 90 

370 380 390 400 410 420 

LEXl 51 CCCCGCCTGCGGCTCTCCTACCGAGACCTCCTGTCTGCCAACCGCTCTGCCATCTTTCTG 

::::::::::::::::::::::::::::::::::::::::::::::: — — ::• — •: — 
gi I 897 CCCCGCCTGCGGCTCTCCTACCGAGACCTCCTGTCTGCCAACCGCTCTGCCATCTTTCTG- 
100 . 110 120 130 140 150 

430 440 450 460 470 480 

LEXl 51 GGCCCCCAGGGCTCCCTGAACCTCCAGGCCATGTACGTAGATGAGTACCGAGACCGCCTC 

::::;::::::::::::::::::::::::::::::::::::::::?:•:: — — : — 5 5 : 
gi I 897 GGCCCCCAGGGCTCCCTGAACCTCCAGGCCATGTACCTAGATGAGTACCGAGACCGCCTC 
160 170 180 190 200 210 



490 500 510 520 530 540 

LEXl 5 1 TTTCTGGGTGGCCTGGACGCCCTCTACTCTCTGCGGCTGGACCAGGC ATGGCCAGATCCC 

:::::::::::::::::::::::::::::::::::::::::::::: — — :: — : — : — 
gi|897 TTTCTGGGTGGCCTGGACGCCCTCTACTCTCTGCGGCTGGACCAGGCATGGCCAGATCCC 
220 230 240 250 260 270 

550 560 570 580 590 600 

LEXl 5 1 CGGGAGGTCCTGTGGCCACCGCAGCCAGGACAGAGGGAGGAGTGTGTTCGAAAGGGAAGA 



gi I 897 CGGGAGGTCCTGTGGCCACCGCAGCCAGGACAGAGGGAGGAGTGTGTTCGAAAGGGAAGA 
280 290 300 310 320 330 

610 620 630 640 650 660 
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LEX151 



GATCCTTTGACAGAGTGCGCCAACTTCGTGCGGGTGCTACAGCCTCACAACCGGACCCAC 



gi I 897 GATCCTirGACAGAGTGCGCCAACTTCGTGCGGGTG^ 

340 350 360 370 380 390 

670 680 690 700 710 720 

LEXl 5,1 CTGCTAGCCTGTGGCACTGGGGCCTTCC AGCCC ACCTGTGCCCTC ATCACAGTTGGCC AC 

gi 1 897 CTCCTAGCCTGTGGCACTGGGGCCTTCC^^ 

400 410 420 430 440 450 

730 740 .750 . 760 770 780 

■ LEXISI CGTGGGGAGCATGTGCTCCACCTGGAGCCTGGCAGTGTGGAAAGTGGCCGGGGGCGGTGC 

gi 1897 CGTGGGGAGCATGTGCTCCACCTGGAGCCTGGCAGTGTGGAAAGTGGCCGGGGGCGGTGC 
460 470 480 490 500 510 

790 800 810 820 830 840 

LEXl 5 1 CCTCACGAGCCCAGCCGTCCCTTTGCCAGCACCTTCATAGACGGGGAGCTGTACACGGGT 

gi 1 897 CCTCACGAGCCCAGCCGTCCCTTTGCCAGCACCT^ 

520 530 540 550 560 570 

850 860 870 880 890 900 

LEX151 CTCACTGCTGACTTCCTGGGGCGAGAGGCCATGATCTTCCGAAGTGGAGGTCCTCGGCCA 

:::::::::::::::::::::::::::::: 5 : 5 5 • 5 •••• 

gi 1 897 CTCACTGCTGACTTCCTGGGGCGAGAGGCCATGATCTTCCGAAGTGGAGGTCCTCGGCCA 

580 590 600 610 620 . / . 630 

910 920 930 940 950 960 

LEXl 5 1 GCTCTGCGTTCCGACTCTGACCAGAGTCTCTTGCACGACCCCCGGTTTGTGATGGCCGCC 

::::::::::::::::::::::::::::::::::: — " — -" = = " = " = = " = = = • 
gi I 897 GCTCTGCGTTCCGACTCTGACCAGAGTCTCTTGCACGACCCCCGGTTTGTGATGGCCGCC 

640 650 660 670 680 690 

970 980 990 1000 1010 1020 

LEXl 5 1 CGGATCCCTGAGAACTCTGACCAGGACAATGACAAGGTGTACTTCTTCTTCTCGGAGACG 

:::::::::::::::::::::::::::::::::: 5 2 : 5 — 5 5 = • = • = = ' 
gi 1 897 CGGATCCCTGAGAACTCTGACCAGGACAATGACAAGGTGTACTTCTTCTTCTCGGAGACG 

700- 710 720 730 740 750 

1030 1040 1050 1060 1070 1080 

LEXl 5 1 GTCCCCTCGCCCGATGGTGGCTCGAACCATGTCACTGTC AGCCGCGTGGGCCGCGTGTGC 

gi I 897 GTCCCCTCGCCCGATGGTGGCTCGAACCATGTCACTGTCAGCCGCGTGGGCCGCGTCTGC 
760 770 780 790 800 810 

1090 1100 1110 1120 1130 1140 

LEXl 5 1 GTGAATGATGCTGGGGGCCAGCGGGTGCTGGTGAACAAATGGAGCACTTTCCTC AAGGCC 

gi 1 8 9 7 GTGAATGATGCTGGGGGCCAGCGGGTGCTGGTGAACAAATGGAGCACTTTCCTCAAGGCC 
820 830 840 850 860 870 

1150 1160 1170 1180 1190 1200 

LEX151 AGGCTGGTCTGCTCGGTGCCCGGCCCTGGTGGTGCCGAGACCCACTTTGACCAGCTAGAG 

• 

gi 1 897 AGGCTGGTCTGCTCGGTGCCCGGCCCTGGTGGTGCCGAGACCCACTTTGACCAGCTAGAG 
880 890 900 910 92p 930 

1210 1220 1230 1240 1250 1260 
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LEXl 5 1 GATGTGTTCCTGCTGTGGCCCAAGGCCGGGAAGAGCCTCGAGGTGTACGCGCTGTTCAGC 

J i ^ J • J ! = J ••••••••••••• • 

gi I 897 GATGTGTTCCTGCTGTGGCCCAAGGCCGGGAAGAGCCTCGAGGTGTACGCGCTGTTCAGC 
940 950 960 970 980 990 

1270 1280 1290 1300 . 1310 1320 

LEX151 ACCGTCAGTGCCGTGTTCCAGGGCTTCGCCGTCTGTGTGTACCACATGGCAGACATCTGG 

gi I 897 ACCGTCAGTGCCGTGTTCCAGGGCTTCGCCGTCTGTGTGTACCACATGGCAGACATCTGG 
1000 1010 1020 1030 1040 1050 

1330 1340 1350. 1360 1370 1380 

LEXl 5 1 GAGGTTTTCAACGGGCCCTTTGCCCACCGAGATGGGCCTCAGCACCAGTGGGGGCCCTAT 

::::::::::::::::::::::::::::::::::::::::: ^ :::::: 2 = : — • 
gi I 897 GAGGTTTTCAACGGGCCCTTTGCCCACCGAGATGGGCCTCAGCACCAGTGGGGGCCCTAT 
1060 1070 1080 1090 1100 1110 

1390 1400 1410 1420 1430 1440 

LEXl 5 1 GGGGGCAAGGTGCCCTTCCCTCGCCCTGGCGTGTGCCCCAGCAAGATGACCGCAC AGCCA 



gi I 897 GGGGGCAAGGTGCCCTTCCCTCGCCCTGGCGTGTGCCCCAGCAAGATGACCGCACAGCCA 
1120 1130 1140 1150 1160 1170 

1450 1460 1470 1480 1490 1500 

LEX151 GGACGGCCTTTTGGCAGCACCAAGGACTACCCAGATGAGGTGCTGCAGTTTGCCCGAGCC 



gi I 897 GGACGGCCTTTTGGCAGCACCAAGGACTACCCAGATGAGGTGCTGCAGTTTGCCCGAGCC 
1180 1190 1200 1210 1220 1230 

1510 1520 1530 1540 1550 1560 

LEXl 5 1 CACCCCCTCATGTTCTGGCCTGTGCGGCCTCGACATGGCCGCCCTGTCCTTGTCAAGACC 

:::::::::::::::::::::::::::::::::::::::::: J ::::: i J : = J — : — 
gi I 897 CACCCCCTCATGTTCTGGCCTGTGCGGCCTCGACATGGCCGCCCTGTCCTTGTCAAGACC 
1240 1250 1260 1270 1280 1290 

1570 1580 1590 1600 1610 1620 

LEXl 5 1 CACCTGGCCCAGCAGCTACACCAGATCGTGGTGGACCGCGTGGAGGCAGAGGATGGGACC 

gi I 897 CACCTGGCCCAGCAGCTACACCAGATCGTGGTGGACCGCGTGGAGGCAGAGGATGGGACC 
1300. 1310 1320 1330 1340 1350 

1630 1640 1650 1660 .1670 1680 

LEXl 51 TACGATGTCATTTTCCTGGGGACTGACTCAGGGTCTGTGCTCAAAGTCATCGCTCTCCAG 

gi | 897 TACGATGTCATTTTCCTGGGGACTGACTCAGGGTCTGTGCTCAAAGTCATCGCTCTCCAG 
1360 1370 1380 1390 1400 1410 

. 1690 1700 1710 1720 1730 1740 

. LEX151 GCAGGGGGCTCAGCTGAACCTGAGGAAGTGGTTCTGGAGGAGCTCCAGGTGTTTAAGGTG 
:::::::::::::::::::::::::::::::::::::::::::::: - ::::::::::: 
gi|897 GCAGGGGGCTCAGCTGAACCTGAGGAAGTGGTTCTGGAGGAGCTCCAGGTGTTTAAGGTG 
1420 1430 1440 1450 1460 1470 

1750 1760 1770 1780 1790 1800 . 

LEXl 5 1 CCAACACCTATCACCGAAATGGAGATCTCTGTCAAAAGGCAAATGCTATACGTGGGCTCT 

gi I 897 CCAACACCTATCACCGAAATGGAGATCTCTGTCAAAAGGCAAATGCTATACGTGGGCTCT 
1480 1490 1500 1510 1520 1530 

1810 1820 1830 1840 1850 1860 
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LEXl 5 1 CGGCTGGGTGTGGCCCAGCTGCGGCTGCACCAATGTGAGACTTACGGCACTGCCTGTGCA 



gi I 897 CGGCTGGGTGTGGCCCAGCTGCGGCTGCACCAATGTGAGACTTACGGCACTGCCTGTGCA 
1540 1550 1560 . 1570 1580 1590 

1870 1880 1890 1900 . 1910 1920 

LEX151 GAGTGCTGCCTGGCCCGGGACCCATACTGTGCCTGGGATGGTGCCTCCTGTACCCACTAC 



gi I 897 GAGTGCTGCCTGGCCCGGGACCCATACTGTGCCTGGGATGGTGCCTCCTGTACCCACTAC 
1600 1610 1620 1630 1640 1650 

1930 1940 . 1950 1960 1970 1980 . 

LEXl 5 1 CGCCCCAGCCTTGGCAAGCGCCGGTTCCGCCGGCAGGACATCCGGCACGGCAACCCTGCC 

■••••••••■•■••••••«••••••••*•«••••••••>••••••••■•••••••••■■•' 

■•>■••>>••*•■••••••••••■.••»•••••■•••••••••■•••••••••••■••••• 

gi I 897 CGCCCCAGCCTTGGCAAGCGCCGGTTCCGCCGGCAGGACATCCGGCACGGCAACCCTGCC 

1660 1670 1680 1690 ' 1700 1710 

1990 2000 2010 2020 2030 2040 

LEXl 5 1 CTGCAGTGCCTGGGCCAGAGCCAGGAAGAAGAGGCAGTGGGACTTGTGGCAGCCACCATG 



gi I 897 CTGCAGTGCCTGGGCCAGAGCCAGGAAGAAGAGGCAGTGGGACTTGTGGCAGCCACCATG 
1720 1730 1740 1750 1760 1770 

2050 2060 2070 2080 2090 2100 

LEX15 1 GTCTACGGCACGGAGCACAATAGCACCTTCCTGGAGTGCCTGCCCAAGTCTCCCCARGCT 



gi I 897 GTCTACGGCACGGAGCACAATAGCACCTTCCTGGAGTGCCTGCCCAAGTCTCCCCAGGCT 
1780 1790 1800 .1810 1820 / 1830 

2110 2120 2130 2140 2150 2160 

LEXl 5 1 GCTGTGCGCTGGCTCTTGCAGAGGCCAGGGGATGAGGGGCCTGACCAGGTGAAGACGGAC 



gi I 897 GCTGTGCGCTGGCTCTTGCAGAGGCCAGGGGATGAGGGGCCTGACCAGGTGAAGACGGAC 
1840 1850 1860 1870 1880 1890 

2170 2180 2190 2200 2210 2220 

LEXl 5 1 GAGCGAGTCTTGCACACGGAGCGGGGGCTGCTGTTCCGCAGGCTTAGCCGTTTCGATGCG 



gi I 897 GAGCGAGTCTTGCACACGGAGCGGGGGCTGCTGTTCCGCAGGCTTAGCCGTTTCGATGCG 
1900. 1910 1920 1930 1940 1950 

2230 2240 2250 2260 2270 2280 

LEXl 51 GGCACCTACACCTGCACCACTCTGGAGCATGGCTTCTCCCAGACTGTGGTCCGCCTGGCT 

gi I 897 GGCACCTACACCTGCACCACTCTGGAGCATGGCTTCTCCCAGACTGTGGTCCGCCTGGCT 
1960 1970 1980 1990 2000 2010 

2290 2300 2310 2320 2330 2340 

LEXl 5 1 CTGGTGGTGATTGTGGCCTCACAGCTGGACAACCTGTTCCCTCCGGAGCCAAAGCCAGAG 

• •■••■,••••••••••••«■••••••••. •••«*•••••••■•••••••••>••••«•••• 

• ••••••••»••••••«••••■•••••••••••'••••••••••••••••••••■•••••« 

gi I 897 CTGGTGGTGATTGTGGCCTCACAGCTGGACAACCTGTTCCCTCCGGAGCCAAAGCCAGAG 
2020 2030 2040 2050 2060 2070 

2350 2360 2370 2380 2390 2400 

LEXl 5 1 GAGCCCCCAGCCCGGGGAGGCCTGGCTTCCACCCCACCCAAGGCCTGGTACAAGGACATC 

*••••••••••••••>•■••••••••.••••■>••■•••>•••••••••■••••»••••• 

•«•••■•••••■••••••••■■•••*•••••••••••••••••••■•••••••■•••»•• 

gi I 897 GAGCCCCCAGCCCGGGGAGGCCTGGCTTCCACCCCACCCAAGGCCTGGTACAAGGACATC 

2080 2090 2100 2110 2120 2130 



2410 2420 2430 2440 2450 2460 
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LEXl 5 1 CTGCAGCTC ATTGGCTTCGCCAACCTGCCCCGGGTGGATGAGTACTGTGAGCGCGTGTGG 

**** » «•••*•«*« 

*************•***«**•**•**•••#«»«■•«*««*•*«*■*«*••««**»•**•■ 

gi I 897 CTGCAGCTCATTGGCTTCGCCAACCTGCCCCGGGTGGATGAGTACTGTGAGCGCGTGTGG 
2140 2150 2160 2170 2180 2190 

2470 2480 2490 2500 2510 2520 

LEXl 5 1 TGCAGGGGGACCACGGAATGCTCAGGCTGCTTCCGGAGCCGGAGCCGGGGCAAGCAGGCC 
***■***■■*•*•••••••••••••■•••••••••■••■•••••>■•■••■•••••■■■( 

gi I 897 TGCAGGGGCACCACGGAATGCTCAGGCTGCTTCCGGAGCCGGAGCCGGGGCAAGCAGGCC 
2200 2210 2220 2230 2240 2250 

2530 2540 2550 2560 2570.. 2580 

LEXl 51 AGGGGCAAGAGCTGGGCAGGGCTGGAGCTAGGCAAGAAGATGAAGAGCCGGGTGCATGCC 

!ir2!tt^ •••••••••••••«••••■■•»«• •*••••■ ••••• 

***■■***'••••••••>*•••••*•«••••••••••••••••••••«•••>••••••«, 

gi I 897 AGGGGCAAGAGCTGGGCAGGGCTGGAGCTAGGCAAGAAGATGAAGAGCCGGGTGCATGCC 

2260 2270 2280 2290 2300 2310 

2590 2600 2610 2620 

LEXl 5 1 GAGCACAATCGGACGCCCCGGGAGGTGGAGGCCACGTAG 

••••••••••••••••••••••••••••••••••»»»•■ 

••••••"*••••■••••••••••••••••••»••••••• 

gi I 897 GAGCACAATCGGACGCCCCGGGAGGTGGAGGCCACGTAGAAGGGGGCAGAGGAGGGGTGG 

2320 2330 2340 2350 2360 2370 

gi I 897 TCAGGATGGGCTGGGGGGCCCACTAGCAGCCCCCAGCATCTCCCACCCACCCAGCTAGGG 
2380 2390 2400 2410 2420 2430 

»gi|8978201|dbj |AB029496.1| Homo sapiens mRNA for semap (4700 nt) 
rev-comp initn: . 83 initl: 83 opt: 95 /' 
67.105% identity in 76 nt overlap (119-46:407-475) ' 

150 140 130 120 110 100 

LEXl 5 - GCGGGAAGAGGGGCGGAGGAGAGAAGGAGGCTGGGGCCTTGCCGTCCACCTGCCGCTTCT 

gi I 897 ACAACCGGACCCACCTGCTAGCCTGTGGCACTGGGGCCTTCCAGCCCACCTGTGCC—CT 
380 390 400 410 420 430 

90 80 70 60 50 40 

LEX15- CCTTCCACCTTGTTGGCC-CAGTGCAG-GCTTTTGTGCCACACTGGCCAGCTCCCCATTG 

; • ••••••• • • • •••• •••• 

• • ••••••• • ••• • « •••• •••• 

gi|897 CATCACA GTTGGCCACCGTGGGGAGCATGTGCTCCAC-CTGGAGCCTGGCAGTGTG 

440 450 460 470 480 

30 20 10 

LEXl 5- GGAAGACCTTCCCAGCTAGGGCACAGGCCAT 

gi I 897 GAAAGTGGCCGGGGGCGGTGCCCTCACGAGCCCAGCCGTCCCTTTGCCAGCACCTTCATA 
490 500 510 520 530 540 



2628 residues in 1 query sequences 
4700 residues in 1 library sequences 
Scomplib [version 3.3t05 March 30, 2000] 

start: Fri Sep 19 13:50:44 2003 done: Fri Sep 19 13:50:45 2003 
Scan time: 0.117 Display time: 0.133 

Function used was FASTA 
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FASTA searches a protein or DNA sequence data bank 

•version 3.3t05 March 30, 2000 
Please cite: 

W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448 

/tmp/fastaGAAvaaG8P: 781 aa 

vs. /tmp/f astaHAAwaaG8P library 
searching /tmp/f as taHAAwaaG8P library 

814 residues in 1 sequences 

FASTA (3.34 January 2000) function [optimized, BL50 matrix (15: -5)] ktup: 2 

join: 38, opt: 26, gap-pen: -12/ -2, width: 16 

Scan time: , 0.034 
The best scores are: opt 
SEQ ID N0:1 human semaphorin MACALAGKVFPMGSWPVWHK ( 814) 5462 

\^SSEQ^Z pt ;^NO;-a^hl ^ (814 aa) 

initn: 5462 initl: 5462 opt: 5462 
Smith-Waterman score: 5462; 100.000% identity in 781 aa overlap (1-781:34-814) 

10 20 30 

SEQ MAPSAWAICWLLGGLLLHGGSSGPSPGPSV 



SEQ GGSRANYNRRPAGPEGGSAGRRQRCPQFPSMAPSAWAICWLLGGLLLHGGSSGPSPGPSV 
10 20 30 40 50 60 

• 40 50 . 60 70 80. 90 

SEQ PRLRLSYRDLLSANRSAIFLGPQGSLNLQAMYLDEYRDRLFLGGLDALYSLRLDQAWPDP 



SEQ PRLRLSYRDLLSANRSAIFLGPQGSLNLQAMYLDEYRDRLFLGGLDALYSLRLDQAWPDP 
70 80 90 100 110 120 

100 110 120 130 140 150 

SEQ REVLWPPQPGQREECVRKGRDPLTECANFVRVLQPHNRTHLLACGTGAFQPTCAL/ITVGH 



SEQ REVLWPPQPGQREECVRKGRDPLTECANFVRVLQPHNRTHLLACGTGAFQPTCAIilTVGH 
130 140 150 160 170 180 

160 170 180 190 200 210 

SEQ RGEHVLHLEPGSVESGRGRCPHEPSRPFASTFIDGELYTGLTADFLGREAMIFRSGGPRP 



SEQ RGEHVLHLEPGSVESGRGRCPHEPSRPFASTFIDGELYTGLTADFLGREAMIFRSGGPRP 
190 200 210 220 230 240 

220 230 240 250 260 . 270 

SEQ ALRSDSDQSLLHDPRFVMAARIPENSDQDNDKVYFFFSETVPSPDGGSNHVTVSRVGRVC 



SEQ ALRSDSDQSLLHDPRFVMAARIPENSDQDNDKVYFFFSETVPSPDGGSNHVTVSRVGRVC 
250 260 270 280 290 300 

280 290 300 310 320 330 

SEQ VNDAGGQRVLVNKWSTFLKARLVCSVPGPGGAETHFDQLEDVFLLWPKAGKSLEVYALFS 

• SEQ VNDAGGQRVLVNKWSTFLKARLVCSVPGPGGAETHFDQLEDVFLLWPKAGKSLEVYALFS 
310 320 330 340 350 360 

340 350 360 370 380 390 

SEQ TVSAVFQGFAVCVYHMADIWEVFNGPFAHRDGPQHQWGPYGGKVPFPRPGVCPSKMTAQP 



httD://hioinformatic<;. Iexgen.com/tools/fasta3.DhD3 
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• SEQ TVSAVFQGFAVCVYHMADIWEVFNGPFAHRDGPQHQWGPYGGKVPFPRPGVCPSKMTAQP 
370 380 390 400 410 420 

400 410 420 430 440 450 

SEQ GRPFGSTKDYPDEVLQFARAHPLMFWPVRPRHGRPVLVKTHLAQQLHQIWDRVEAEDGT 

SEQ GRPFGSTKDYPDEVLQFARAHPLMFWPVRPRHGRPVLVKTHLAQQLHQIWDRVEAEDGT 
430 440. 450 460 470 480 

460 470 480 490 500 . 510 

SEQ YDVIFLGTDSGSVLKVIALQAGGSAEPEEWLEELQVFKVPTPITEMEISVKRQMLYVGS 

SEQ VDVIFLGTDSGSVLKVIALQAGGSAEPEEWLEELQVFKVPTPITEMEISVKRQMLYVGS 
490 500 510 520 530 540 

520 530 540 550 560 570 

SEQ RLGVAQLRLHQCETYGTACAECCLARDPYCAWDGASCTHYRPSLGKRRFRRQDIRHGNPA 

SEQ RLGVAQLRLHQCETYGTACAECCLARDPYCAWDGASCTHYRPSLGKRRFRRQDIRHGNPA 
550 560 570 580 590 600 

580 590 600 610 620 630 

SEQ LQCLGQSQEEEAVGLVAATMVYGTEHNSTFLECLPKSPAAVRWLLQRPGDEGPDQVKTDE 

SEQ LQCLGQSQEEEAVGLVAATMVYGTEHNSTFLECLPKSPAAVRWLLQRPGDEGPDQVKTDE 
610 620 630 640 650 660 

•640 650 660 670 680 690 

SEQ RVLHTERGLLFRRLSRFDAGTYTCTTLEHGFSQTWRLALWIVASQLDNLFPPEPKPEE 

SEQ RVLHTERGLLFRRLSRFDAGTYTCTTLEHGFSQTVVRLALVVIVASQLDNLFPPEPKPEE 
670 680 690 700 710 720 

700 710 720 730 740 750 

SEQ PPARGGLASTPPKAWYKDILQLIGFANLPRVDEYCERVWCRGTTECSGCFRSRSRGKQAR 

SEQ PPARGGLASTPPKAWYKDILQLIGFANLPRVDEYCERVWCRGTTECSGCFRSRSRGKQAR 
730 740 750 760 770 780 

760 770 780 

GKSWAGLELGKKMKSRVHAEHNRTPREVEAT 

GKSWAGLELGKKMKSRVHAEHNRTPREVEAT 
790 800 810 



781 residues in 1 query sequences 
814 residues in 1 library sequences 
Scomplib [version 3.3t05 March 30, 2000] 

start: Mon Feb 10 16:26:11 2003 done: Mon Feb 10 16:26:12 2003 
Scan time: 0.034 Display time: 0.866 

Function used was FASTA 




SEQ 
SEQ 
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Axons are guided along specific pathways by attractive and repulsive cues 
In the extracellular environment. Genetic and biochemical studies have led 
to the identification of highly conserved families of guidance molecules, 
including netrins, Slits, semaphorins, and ephrins. Guidance cues steer 
axons by regulating cytoskeletal dynamics in the growth cone through 
signaling pathways that are still only poorly understood. Elaborate regu- 
latory mechanisms ensure that a given cue elicits the right response from 
the right axons at the right time but is otherwise ignored. With such 
regulatory mechanisms in place, a relatively small number of guidance 
factors can be used to generate intricate patterns of neuronal wiring. 



The con-ect wiring of the nervous system 
relies on the uncanny ability of axons and 
dendrites to locate and recognize their appro- 
priate synaptic partners. To help them find 
, their way in the developing embryo, axons 
and dendrites are tipped with a highly motile 
and exquisitely sensitive structure, the 
growth cone. Extracellular guidance cues can 
either attract or repel growth cones, and can 
operate either at close range or over a dis- 
tance (I). By responding to the appropriate 
set of cues, growth cones are able to select the 
correct path toward their target. 
"Ten years ago (2), very few of the molecules 
. that guide axons in vivo were known. But the 
1970s and 'SOs had seen the introduction of 
several powerful in vitro assays to detect guid- 
ance activities in the developing vertebrate ner- 
vous system, and the growing interest of inver- 
tebrate geneticists in the problem of axon guid- 
ance. So by the early 1 990s, the stage had been 
set for a burst of activity that led to the discov- 
ery of several conserved families of axon guid- 
|ance molecules. Prominent among these are the 
netrins, Slits, semaphorins, and ephrins (Fig. 1). 
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These are not the only known guidance mole- 
cules, but they are by far the best understood. 
With these molecules in hand, we can now 
begin to ask how growth cones sense and re- 
spond to guidance cues, and how a relatively 
small number of cues can be used to assemble 
complex neuronal networks. 

Guidance Cues and Their Receptors 

Netrins, The discovery of netrins came as the 
remarkable convergence of the search for a 
chemoattractant for vertebrate commissural 
axons (J, 4), and the analysis of genes re- 
quired for circumferential axon guidance in 
Caenorhabditis elegans (5, tf). Across more 
than 600 million years of evolution, netrins 
have retained the function of attracting axons 
ventral ly toward the midline (7). Netrins can 
also repel some axons, and this function too 
has been conserved. This was initially in- 
ferred from defects in dorsal as well as ven- 
tral guidance in unc'6/netrin mutant worms 
(i), and subsequently confirmed by the direct 
demonstration of netrin*s repulsive activity in 
vertebrates (5) and in flies (P, 10). 

Identification of the netrin receptors fol- 
lowed from the characterization of two other 
worm mutants with defects in circumferential 
guidance: unc-40, which primarily disrupts 



ventral guidance; and w/tc-5. which affects only 
dorsal guidance {S). Both nnC'40 and rmc-5 
encode conserved transmembrane proteins (7), 
with UNC-40 belonging to the DCC (deleted in 
colorectal carcinoma) family. Biochemical and 
genetic studies have confirmed their functions 
as netrin receptors in several different species 
(7, 10). DCC receptors mediate attraction to 
netrins but can also participate in repulsion. 
UNC-5 receptors appear to function exclusively 
in repulsion, either alone or in combination with 
DCC receptors, UNC-5 receptors may require a 
DCC coreceptor for repulsion farther away 
from the netrin source, where ligand concentra- 
tion is likely to be lower (5, 10). This may 
involve a direct interaction be^veen the cyto- 
plasmic domains of the two receptors (77). 

Netrins guide many different axons in 
vivo. In some cases, netrin can exert its ef- 
fects from distances of up to a few millime- 
ters (72), but in others it appears to act only at 
short range (P). Netrins have high affinity for 
cell mernbranes (i, 4), and it is unclear how 
far they can diffuse in vivo and how their 
diffusion is regulated. Indeed, a netrin gradi- 
ent has not yet been visualized directly in any 
system, and formal proof that netrin must 
diffuse away from its source to exert its 
long-range effects is lacking. 

Slits. Slits are large secreted proteins that 
signal through Roundabout (Robo) family re- 
ceptors. Robo was first identified in a genetic 
screen for midline guidance defects in Dro- 
sophila {13, 14). Genetic studies suggested that 
Robo is the receptor for a midline repellent 
(7^). subsequently identified as Slit (7 J. 75). 
This repulsive action of Slit was found to be 
conserved in vertebrates (77, 18). However, in a 
parallel approach. Slit was also purified as a 
factor that stimulates sensory axon branching 
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and elongation (19). Thus. Slits, like netrins, are 
multifunctional. The importance of Slit-Robo 
signaling in axon guidance and cell migration is 
underscored by the recovery ofrobo mutations 
in genetic screens for guidance defects in at 
least three different species (13, 20, 21) and the 
purification of Slit in at least three independent 
biochemical assays (19, 22, 23), 

The best-understood functions of Slit pro- 
teins are in midline guidance in Drosophila and 
in the formation of the optic chiasm in verte- 
brates. In Drosophila, Slit is expressed at the 
ventral midline, where it acts as a short-range 
repellent signaling through Robo to prevent 
ipsilateral axons from crossing the midline and 



POLARITY 

conrunissural axons from recrossing (15, 16). 
Two other Slit receptors, Robo2 and Robo3, 
specify the lateral positions of axons that run 
parallel to the midline, presumably in response 
to a long-range gradient of Slit activity diffus- 
ing away from the midline (24, 25). 

Vertebrate Slit proteins are also expressed by 
ventral midline cells (17), and commissural ax- 
ons are repelled by Slit after they have crossed 
the midline (26), Mice deficient for both Slit! 
and Siit2 lack any obWous defects in midline 
guidance in the spinal cord (27), but SIit3 is still 
expressed at the midline in these mice. 

SlitI/2-deficient mice do have striking de- 
fects in the formation of the optic chiasm. 




fjf CM aSt^!* m"^'^'" °^ g"'<*3nce molecules (A) and their receptors (B). Domain names are 
from SMART (httpy/smartembl-heidelberg.de). PI to 93, DB (DCC-binding). CCO to CC3. and SP1 
and 5P2 indicate conserved regions in the cytoplasmic domains of DCC. UNC-S, Robo, and Plexin 
receptors, respectively. 



where Slit3 is not expressed (27). These de- 
fects are strikingly reminiscent of those seen 
in astraylrobo2 mutant fish, in which retinal 
axons make multiple guidance errors before, 
during, and af^er crossing the midline (21). 
Similar errors also occur in wild-type fish but 
are always corrected (28), In fish, all retinal 
axons project contralateral ly, but in mice, 
which have binocular vision, some axons 
project contralaterally and others ipsilateral- 
ly. By analogy to the role of Drosophila Slit 
in midline guidance, it was anticipated that 
the vertebrate Slit proteins might be ex- 
pressed at the chiasm and control the choice 
of an ipsilateral or contralateral projection. 
This is not the case (29). Instead, Slitl and 
Slit2 are expressed by cells surrounding the 
chiasm and repel ipsilateral and contralateral 
axons alike (23, 27, 30, 31). This has led to 
the idea that Slits form a repulsive com'dor to 
guide all retinal axons through the chiasm. 

Semaphorins. Semaphorins are a large 
family of cell surface and secreted guidance 
molecules, defined by the presence of a con- 
served "-420 -amino acid Sema domain at 
their NH2-termini. The first semaphorins 
were identified by searching for molecules 
expressed on specific axon fascicles in the 
grasshopper central nervous system (CHS) 
(32) and by purifying a potent inducer of 
vertebrate sensory growth cone collapse in 
vitro (33). Semaphorins are divided into eight 
classes, on the basis of their structure. Classes 
I and 2 are found in invertebrates, classes 3 to 
7 are found in vertebrates, and class V sema- 
phorins are encoded by viruses (34). 

Semaphorins signal through multimeric re- 
ceptor complexes. The composition of these 
receptor complexes is not fully knowTi. Many, 
and perhaps all, semaphorin receptor complex- 
es include a plexin protein. Plexins comprise a 
large family of transmembrane proteins divided 
into four groups (A to D), on the basis of 
sequence similarity (3S). Drosophila PlexinA is 
a functional receptor for the transmembrane 
Sema la (36), vertebrate plexin- As are func- 
tional receptors for secreted class 3 semaphor- 
ins (35, 37), and other plexins bind directly to 
semaphorins of different classes (35, 38, 39). 
Receptor complexes for the vertebrate class 3 
semaphorins also include neuropilins, which 
bind direcdy to both semaphorins and plexins 
(34). Neuropilins do not appear to have a sig- 
naling function, but rather contribute to ligand 
specificity. Other essential components of 
semaphorin receptor complexes include the 
neural cell adhesion molecule LI (for Sema3A) 
(40), the receptor tyrosine kinase Met (for 
Sema4D) (41), and the calalytically inactive 
receptor tyrosine kinase OTK (for Drosophila 
Semala) (42). 

Genetic analysis of semaphorin function 
in flies and in mice suggests that they primar- 
ily act as short-range inhibitory cues that 
deflect axons away from inappropriate re- 
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gions, or guide them 
through repulsive 
corridors (34, 37), 
Evidence suggests 
that semaphorins 
may also act as at- 
tractive cues for cer- 
tain axons (i^, 43\ 
although this re- 
mains to be verified 
by genetic analysis. 
Interestingly, sema- 
phorins do not seem 
to function in axon 
guidance in C. e/- 
egans, but instead 
have an analogous 
role in discouraging 
inappropriate cell 
contacts. Worms 
have three sema- 

phorin and two plexin genes, all of which 
have been mutated {44-^6). In these mutants, 
epidermal cells that should only transiently 
contact one another instead make more per- 
durant contacts. 

Ephrins. In a classic paper {47), Sperry 
postulated that vertebrate retinal axons are 
guided to their appropriate topographic loca- 
tions in the optic tectum by an orthogonal 
system of molecular gradients in the retina 
and the tectum. The search for these graded 
cues led to the identification of the ephrins, 
membrane-bound ligands for the Eph family 
of receptor tyrosine kinases (48, 49). Ephrins 
and Eph receptors fall into two classes: eph- 
rin-As, which are anchored to the membrane 
by a glycosylphosphatidylinositol (GPI) link- 
age and bind EphA receptors; and ephrin-Bs, 
which have a transmembrane domain and 
bind EphB receptors (50), 

In the visual system, topographic mapping 
of retinal axons along the anterior-posterior 
axis depends on repulsion mediated by eph- 
rin-A ligands and their EphA receptors (50). 
Ephrin-A ligands are expressed in a gradient 
in the tectum [or its mammalian equivalent, 
the superior colliculus (SC)], and EphA re- 
ceptors are expressed in a complementary 
gradient in the retina. Retinal axons with 
successively higher EphA levels map to suc- 
cessively lower points along the ephrin-A 
gradient. If the ephrin-A gradient is eliminat- 
ed in the mouse SC. then retinal axons do not 
all shift to one end of the SC, as would be 
expected if each retinal axon simply mapped 
to a specific threshold value on the ephrin-A 
gradient. Instead, retinal axons still fill the 
entire SC, but their topographic order is dis- 
-upted— some axons shift posterioriy and 
others anteriorly (57). This suggests that the 
ephrin-A gradient establishes the topographic 
order of retinal axons, but not their precise 
termination sites. Further support for this 
model comes from a clever genetic experi- 
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Fig. 2. (A and B) A model showing one way in which a growth cone might turn toward an attractant (green). 



ment in which half the retinal axons were 
forced to express higher levels of an EphA 
receptor (52), Those axons with extra EphA 
receptors shifted down the ephrin-A gradient, 
whereas those with only their endogenous 
levels shifted up the gradient. The result was 
two smooth maps, one in each half of the SC. 
The conclusion is that the mapping of retinal 
axons depends on their relative EphA levels, 
not their absolute levels. 

Mapping along the dorsal-ventral axis, in 
contrast, involves attractive signaling mediated 
by ephrin-B ligands and EphB receptors (53, 
54). Correct mapping of retinal axons along 
this axis evidently requires both **forward" sig- 
naling, in which ephrin-B ligands activate 
EphB receptors, and "reverse" signaling, in 
which EphBs serve as ligands to signal back 
through the transmembrane ephrin-Bs. 

Ephrins control axon guidance in many 
other places too, and the ability to signal in 
either direction is a common theme, as is the 
ability to mediate either attraction or repul- 
sion (50). For example, ephrin-B reverse sig- 
naling repels forebrain conunissural axons 
away from regions of EphB expression (55) 
while attracting them to regions of EphA4 
expression (56). The GPI-anchored ephrin- 
As are also able to signal in the reverse 
direction (57) and may act in this mode to 
mediate attraction or adhesion during map- 
ping of vomeronasal axons to the accessory 
olfactory bulb (58). 

Mammals have 13 Eph receptors and 8 
ephrins. Worms and flies both have just a 
single Eph receptor, with four and one ephrin 
ligands, respectively. Somewhat surprisingly, 
the invertebrate ephrin and Eph mutants do 
not have dramatic axon guidance defects (59- 
63). The C elegans ephrins and the Eph 
receptor do, however, have critical functions 
in multiple aspects of epithelial morphogen- 
esis, as do their vertebrate counterparts (50). 
It seems that ephrins and Eph receptors are an 



ancient but versatile system for cell-cell com- 
munication that has diversified and acquired 
its axon guidance functions primarily during 
vertebrate evolution. 



Steering the Growth Cone 

Cytoskeleton. Growth cone turning is a com- 
plex process in which actin-based motility is 
harnessed to produce persistent and directed 
microtubule advance (Fig. 2). Actin filaments 
are organized; into two distinct populations: 
dense, parallel filaments that radiate outward 
and into filopodia; and intervening networks 
of loosely interwoven filaments (64). Filopo- 
dial filaments are oriented with their fast- 
growing barbed ends toward the filopodium 
tip. The extension and retraction of a filopo- 
dium reflect the balance between the poly- 
merization of actin at barbed ends and the 
retrograde flow of entire filaments (65-67). 
Filopodia often extend asymmetrically before 
the entire growth cone turns (68-70), and 
without filopodia, growth cones become dis- 
oriented (69, 71, 72). The precise role of 
filopodia in growth cone turning remains un- 
clear, but they have been postulated to steer 
. the growth cone by differential adhesion (73), 
generating mechanical force (74), or trans- 
ducing distal signals (75). 

Microtubules form stable, cross-linked 
bundles in the axon shaft. Single microtubule 
filaments also emerge into the growth cone. 
These filaments display the classic properties 
of dynamic instability, extending and retract- 
ing as they explore the peripheral region of 
the growth cone (76). These dynamic micro- 
tubules grow preferentially along the fllopo- 
dial actin filaments (76, 77), and the capture 
or stabilization of microtubule bundles in a 
specific filopodium may be a critical event in 
growth cone turning. Consistent with this 
view, stabilization and dilation of a single 
filopodium appear to be a common feature of 
growth cone turning in vivo (78-80). 
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There are many different ways in which a 
guidance signal might intervene to steer the 
growth cone. For example, a guidance cue 
might promote the initiation, extension, sta- 
bilization, or retraction of individual fllopo- 
dia, or the capture or stabihzation of micro- 
tubules in specific regions of the growth 
cone. Likely targets for the signaling path- 
ways downstream of guidance receptors are 
therefore molecules such as Arp2/3 (to nucle- 
ate new actin filaments), EnaA^ASP proteins 
(to promote filament elongation), adhesion 
molecules (to couple actin filaments to the 
substrate), and myosins (to regulate the ret- 
rograde fiow of actin filaments). Molecules 
that capturie microtubule ends (e.g., IQGAPl) 
or suppress microtubule instability (e.g., 
MAPIB) are also potential targets for guid- 
ance signals. We still need to determine 
which aspect(s) of actin or microtubule dy- 
namics are the primary targets for regulation 
by each of the known guidance cues. It is - 
difficult to trace the signaling pathways 
downstream of a guidance receptor without 
knowing what lies at the "business end." 

Signaling. With their well-known roles in 
regulating cytoskeletal dynamics in fibroblasts, 
Rho guanosine triphosphatases (GTPases) were 
strong candidates to transduce guidance signals 
in the growth cone. A function for Rho 
GTPases in growth cone guidance was suggest- 
ed from studies with dominant mutant isofomis 
(81) and was confirmed by the analysis of 
loss-of-flinction mutations in flies and worms 
{82-^5). Biochemical links have also been 
made behveen several guidance receptors and 
Rho GTPases. For example, EphA receptors 
regulate the guanine nucleotide exchange factor 
(GEF) Ephexin {86)\ Robo receptors may act at 
least in part by regulating GTPase-activating 
proteins (GAPs) {87); and Plexins bind directly 
to Rho GTPases {88) and Rho GEFs {39), and 
may even have intrinsic GAP activity {89). 
Several downstream effectors of Rho GTPases 
have also been implicated in axon growth and 
guidance, such as Pak {90) and Rho kinase {9}). 

Genetic studies have also revealed impor- 
tant roles for EnaA^ASP proteins in axon guid- 
ance {92-94). These proteins antagonize 
capping- proteins to promote actin filament 
elongation {9S). In motile fibroblasts, Ena/ 
VASP proteins localize to the leading edge of 
lamellipodia. Depletion of EnaA^ASP proteins 
from the leading edge leads to shorter, more 
highly branched filaments that generate greater 
protixisive force and increased motility. Con- 
versely, increasing EnaA^ASP levels at the 
leading edge results in longer, unbranched fil- 
aments and reduced motility (Pi. 96). Genetic 
-tudies implicating EnaA^ASP proteins in rc- 
'sive growth cone guidance by both Slit {94) 
d netrin {92) have been interpreted in light of 
this negative role in fibroblast motility. How- 
ever, in growth cones, EnaA^ASP proteins lo- 
calize to filopodial tips {97), where actin fila- 
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ments are normally unbranched and stable. 
Here, their activity would be expected to pro- 
mote filopodial extension, making a role in 
attractive guidance equally plausible. 

These considerations raise an important 
point. Migrating fibroblasts and axonal 
growth cones can have very different cy- 
toskeletal organizations, and the location and 
action of molecules such as EnaA^ASP pro- 
teins and Rho GTPases in growth cones 
cannot be inferred merely by analogy to fi- 
broblasts. It will be important to determine 
precisely when, where, and how these pro- 
teins function in growth cones. 

Calcium signaling may also play an im- 
portant role in growth cone turning. In cul- 
tured Xenoptis spinal neurons, turning in re- 
sponse to a netrin- 1 gradient requires calcium 
influx through the plasma membrane, as well 
as calcium release from intracellular stores 
{98). Moreover, netrin- 1 induces a transient 
Ca^-*- gradient in the growth cone {98), and 
the creation of such a gradient by local pho- 
tolysis of caged Ca^^ or release from intra- 
cellular stores is sufficient to induce turning 
in the absence of netrin- 1 {98, 99). Sponta- 
neous calcium transients have also been ob- 
served in growth cones {100) and in filopodia 
{101). The frequencies of these transients 
appear to correlate negatively with growth 
cone extensipn rates, but compelling evi- 
dence of their involvement in growth cone 
turning in vivo is lacking. 



Plasticity of Guidance Responses 

Axons can evidently differ in their response 
to the same cue, as they must if they are to 
follow divergent pathways. But even a sin- 
gle growth cone may need to respond to the 
same cue in different ways at different 
points along its journey. This is particularly 
true if the growth cone is to navigate 
through a series of intermediate targets be- 
fore reaching its final goal, as many do. 
Specifying an axon's trajectory is therefore 
not just a simple matter of selecting the 
appropriate set of guidance receptors and 
delivering them to the growth cone. The 
growth cone must also be able to modulate - 
its responsiveness en route. Some of the 
mechanisms underlying this plasticity have 
recently come to light. 

Modulation by cyclic nucleotides. In 
vitro, the responses of Xenopus spinal ax- 
ons can be modulated by changing the lev- 
els of cyclic nucleotides {102-104). Re- 
sponses to some guidance cues, including 
netrin- 1, are sensitive to levels of c AMP or 
proteiri kinase A (PKA) activity, while oth- 
ers, including SemaSA. are modulated by 
cGMP and protein kinase G (PKG). The 
general finding is that lowering cAMP or 
cGMP levels or inhibiting PKA or PKG. 
converts an attractive response to a repul- 
sive one, whereas elevating cAMP or 



cGMP, or activating PKA or PKG, switches 
repulsion to attraction. 

Modulation of netrin- 1 responsiveness by 
cAMP levels may play an important role in 
pathfinding of Xenopus retinal axons to the 
tectum {105), These axons are first attracted 
out of the eye by netrin- 1 at the optic nerve 
head, become indifferent to it as they then 
grow through the ventral diencephalon, and 
finally are repelled by netrin-1 once they 
reach the tectum. These changes correlate 
with a gradual decline in cAMP levels and 
can be reversed by artificially raising cAMP 
levels. An intriguing variation on this theme 
has been documented in the mammalian cor- 
tex {106). Sema3A attracts the apical den- 
drites of pyramidal neurons toward the corti- 
cal plate but repels their axons away from it. 
Interestingly, a guanyly] cyclase is specifical- 
ly localized in dendrites, implying that cGMP 
levels may be higher in dendrites than in 
axons. 

Local translation in the growth cone. Ap- 
plying netrin-1 or Sema3A to cultured Xeno- 
pus retinal axons induces local protein syn- 
thesis within the growth cone, and blocking 
translation inhibits the turning but not the 
growth of these axons {107). Induced protein 
synthesis is rapid enough to contribute direct- 
ly to growth cone steering, but work on Xe- 
nopus spinal axons suggests a more subtle 
role: Growth cones might need to synthesize 
new protein^ to maintain their sensitivity as 
they migrate up or down a ligand gradient 
{108). Spinal growth cones undergo consec- 
utive phases of desensitization and resensiti- 
zation to netrin-1 in vitro, and resensitization 
requires protein synthesis. Inhibiting transla- 
tion in spinal axons does not block turning 
toward the netrin-1 source, as it does in ret- 
inal axons {107), but actually causes turning 
away from it {108). This is difficult to explain 
if translation has a direct role in growth cone 
turning, but could be explained by a role in - 
resensitization: If desensitization is more rap- 
id on the side of the growth cone facing 
toward the source, then a failure to synthesize 
the new. proteins needed for resensitization 
could result, paradoxically, in a stronger at- 
tractive signal on the side facing away from 
the source. 

Local translation might also be used to com- 
pletely switch the growth cone's responsiveness 
to specific cues once it reaches an intermediate 
target Evidence for such a mechanism comes 
from the finding that the 3 '-untranslated region 
of the EphA2 mRNA contains a sequence that 
confers selective translation in the distal seg- 
ments of commissural axons, after they have 
crossed the midline {109). This could explain 
why the EphA2 receptor is only expressed at 
high levels in the segments of these axons 
that extend beyond the midline. The implica- 
tion is that commissural axons might become 
sensitive to the ephrinA ligands in the spinal 
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cord only after crossing, although this re- 
mains to be tested 

Switching responses at the midline. To 
reach their targets on the contralateral side 
of the CNS, commissural axons must first 
grow toward the midline, but then leave it 
again on the opposite side and never turn 
back. Experiments in rodents, 
chicks, and flies have suggested 
a simple model for this behav- 
ior, in which commissural 
growth cones switch their sensi- 
tivity to midline attractants and 
repellents as they cross (Fig. 3). 
Before crossing, commissural 
axons are attracted to the mid- 
line by netrin (i. 4) but are 
insensitive to the midline repel- 
lents Slit and, in vertebrates, 
certain class 3 semaphorins (77, 
26). After crossing, these axons 
are insensitive to netrins (at 
least in the vertebrate hindbrain) 
(1 10) but are repelled by both 
Slits and semaphorins (26). 
What turns attraction off and 
repulsion on at the midline? 

One way in which netrin at- 
traction could be turned off is by 
exposure to Slit. This is suggest- 
ed by studies on cultured Xerio- 
pus spinal neurons (///). Young 
spinal axons in vitro, like pre- 
crossing commissural axons in 
vivo, are attracted by netrin and 
are unresponsive to Slit. How- 
ever, when both cues are applied 
simultaneously, netrin can still 
stimulate axon growth but not 
turning. This is not just a simple 
matter of repulsion canceling 
out attraction, because these ax- 
ons are not repelled by Slit at 
all, and other attractive respons- 
es are not affected. Thus, Slit 
specifically silences attraction 
by netrin. This silencing effect 
is mediated by a direct interac- 
tion between the cytoplasmic 
domains of the Robo and DCC 
receptors (///). 

This could explain how attraction by ne- 
trin is shut down at the midline, but what 
turns Slit repulsion on? In flies, Robo recep- 
tors are expressed at high levels on commis- 
sural axons only after crossing, even though 
robo mRNA is expressed early on {14). 
Robo protein is also synthesized before 
crossing, but an intracellular sorting recep- 
(Comm) apparently prevents it from 
^ing delivered to the growth cone, target- 
ing the newly synthesized Robo instead for 
lysosomal degradation {112. 113). Once a 
commissural axon has crossed the midline 



by both transcriptional and posttranscrip- 
tional mechanisms. This allows Robo to be 
delivered to the growth cone, thereby con- 
ferring sensitivity to Siit. Thus, it is not the 
local synthesis or activity of the Robo re- 
ceptor that is regulated in Drosophila com- 
missural axons, but rather its intracellular 



Attracted by netrin, 
insensitive to Slit 
and semaphorins 



comm ON, 
Robo sorted to 
lysosomes 



comm OFF, 
Robo delivered 
to growth cone 



comm OFF. 
Robo delivered 
to growth cone 




Repelled by Slit 
and semaphorins, 
insensitive to netrin 



Fig. 3. Switching sensitivity at the midline. As they cross the floor plate, 
vertebrate commissural axons lose sensitivity to the midline attractant' 
netrin and acquire sensitivity to Slit and semaphorin repellents. This 
switch may be mediated in part by silencing of netrin attraction by Slit 
Drosophila commissural axons also become sensitive to Slit only after 
crossing This appears to reflect Comm*s role in regulating the intracel- 
lular trafficking of Robo. 



to control a wide range of guidance decisions 
in vivo. How can so few molecules contribute 
so much to the correct wiring of the nervous 
system? Two related principles emerging 
from these studies seem to be important. 
First, guidance cues are multiftinctional. A 
single cue can either attract or repel axons, at 
short or long range, and may even 
elicit other responses such as 
branching or an altered sensitivity 
to other cues. Second, growth 
cone responses are remarkably 
plastic, subject to modulation by 
both instrinsic and extrinsic fac- 
tors. Together, these mechanisms 
may underiie much of the diver- 
sity in growth cone behavior. 

What are the major challeng- 
es that still lie ahead? One will 
be to identify more guidance 
factors, in particular those that 
may have more specialized 
functions, and to figure out how 
they work. Another challenge 
will be to gain a better picture of 
how guidance cues steer growth 
cones. We now have a few tan- 
talizing glimpses, but are still a 
long way from a coherent view 
of growth cone turning. Also, 
having learned that the outcome 
of a particular signaling event is 
essentially unpredictable, the 
need is now greater than ever to 
push ahead with the analysis of 
guidance mechanisms in vivo. 
We need to know, for example, 
how the distributions of the var- 
ious guidance molecules are 
controlled in space and time, 
and how each growth cone 
knows when and how to respond 
to these cues. The ultimate chal- 
lenge, after all. is to find out 
how a comparatively small 
number of guidance ^molecules 
generate such astonishingly 
complex patterns of neuronal 
wiring. 



trafficking. This mechanism may also ap- 
ply to Robo2 and Robo3. These two Slit 
receptors are also down-regulated during 
midline crossing, but must be up-regulated 
after crossing for axons to select their ap- 
propriate pathways on the contralateral side 
{24, 25). 



Comm appears to be inactivated, possibly 



Concluding Remarks 

Netrins, Slits, semaphorins, and ephrins are 
not the only guidance cues we know of, and 
many more undoubtedly still await discovery. 
Nevertheless, members of these four families 
have turned up repeatedly in various genetic 
and biochemical assays and have been found 
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the human genome was generated by t^'^^'^ / g^^j o^er 9 months from 

method. The 14.8-bUlion bp DNA ^^^^;;^'^^lf^'^^^^^ of the genome) 
27.271,853 high-quality sequence reads (5 11^^^^^^^^ 

from both ends of plasmid c ones made chromosome 
assembly strategies-^ whole-genome a semb^ an^^^^^^ 

assemblj-were used, each ^-^^"Tf^^^"";^^^^^^^^ into 550-bp 

publicly funded ^^1°"^;^^^^^^^^^^ 

segments to create a 2.9-fold coverage or ino 8 assembly 
sequenced, without including biases inherent in the cloning 

procedure used the P^bUc^,/-^^^^^^^^ of gaps in ; 

erage in the assemblies ^'S^'^^^'.^'/'iS^ B.ll-fold coverage. The 
the final assembly over what would be obtained w tn 

Zo assembly strategies y'f.^'^J.^^f 1^^^^^^^^^^^ 

independent mapping data. The ass mbUeseffec^^^^^^^^^^ ^^^^^ 

regions of the human chromosomes. than su/ | ^ . 

Sffold assemblies of 100,000 ''P^; -XfThl genom sequfnce re^ 
scaffolds of 10 million bp or l^^f^^'^^.'^^y^^.^JS strong corroborating 
26.588 protjn-encoding r^^^^^^^^^ 

evidence and an additional although gene-dense clusters are 

matches or other weak supporting "^^j ^^^^^^ sequence separated 
obvious, almost half the genes =;;e^ 'J "^^^^^^^^ ,A of the genome 
by large tracts of apparently "O"^"*" "S^^f^Hh 75% of the genome being 
is spanned by exons. whereas 24% is in '"t^^. ^^^^^^^ to chro- 

intergenic DNA. Duplications 3^\"fJ,^^;'5:::fand reveal I complex 
mosomal lengths, are abundant throughout the gen^ vertebrate ex- 

evolutionary history, comparative genornic a"a^?^^ " tissue-specific de- 
pansions of genes associated neuronal funrt^ ^ 

Eelopmental ^eg"l^«° ''n the con^^^^^^^ '"d P"^"''^ 

sequence comparisons between the^cons^^^^ 

genome data provided differed at a rate of 1 bp per 

remains an open challenge. 



Decoding of the DNA that constitutes the 
human genome has been widely anticipated 
for the contribution it will make toward un- 
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derstanding human evolution, the causat on 
of disease! and the interplay between the 
environment and heredity in defining the hu- 
man condition. A project with the goal of 
detemiining the complete nucleoUde se- 
quence of the human genome was first for- 
mal^ proposed in 1985 (7). In subsequent 
yeS Sie idea met with mixed reactions in 
L scientific community (2). Howeve. m 
1990 the Human Genome Project (HGP) was 
ofTicially initiated in the United States under 
£ direction of the National Institutes of 
Health and the U.S. Department of Energy 
with a 15-year. S3 billion plan for completmg 
Th genome sequence. In 1998 we announced 
our intention to build a umque genome- 
sequencing facility, to detemime the se^ 
quence of the human genome over a 3-year 
Iriod. Here we report the penultm^ate mi e- 
stone along the path toward that goal, a nearly 
complete sequence of the euchromatic por- 
S of the human genome. The sequencing 
was perfomied by a whole-genome random 
shot^ method with subsequent assembly of 

*^rSiXSy"ofDNAse^encmg 
began in 1977. when Sanger re^rted h« rneth- 
od for detemuning the order of nucleondes of 



DNA using chain-terminating nucleotide ana- 
logs (3). In the same year, the first human gene 
was isolated and sequenced {4). In 1986, Hood 
and co-workers (J) described an improvement 
in the Sanger sequencing method that included 
attaching fluorescent dyes to the nucleotides, 
which permitted them to be sequentially read 
by a computer. The first automated DNA se- 
quencer, developed by Applied Biosystems in 
California in 1987, was shown to be successfid 
when the sequences oftwo genes were obtamed 

with this new technology (d). From early se-. 
quencing of human genomic regions (7), it 
became clear that cDNA sequences (which are 
reverse-transcribed &om RNA) would be es- 
sential to annotate and validate gene predictions 
in the human genome. These studies were the 
basis in part for the development of the ex- 
pressed sequence tag (EST) method of gene 
identification (S). which is a random selechon. 
very high throughput sequencing approach to 
characterize cDNA Ubraries. The EST method 
led to the rapid discovery and mapping of hu- 
man genes (9). The increasing 
man EST sequences necessitated the deve op- 
ment of new computer algorithms to an^yze 
large amounts of sequence data, and m 1993 at 
The Institute for Genomic Research (TIGR). an 
algorithm was developed that perautted asseni- 
bly and analysis of hundreds of thousands of 
ESTs. This algorithm permitted charactenza- 
tion and annotation of human genes on the basis 
of 30,000 EST assembUes {10). . 
. . The complete 49-kbp bacteriophage lamb- 
da genome sequence was detemuned by a 
shotgun restriction digest method m 198i 
(//). When considering methods for sequenc- 
ing the smallpox virus genome m 199 1 {i^), 
a whole-genome shotgun sequencing method 
was discussed and subsequently i^ected ow- 
ing to the lack of appropriate software tools 
fo? genome assembly. However, m 1994 
when a microbial genome-sequencmg project 
was contemplated at TIGR, a whole-genome 
shotgun sequencing approach was considered 
possiWe with the TIGR EST assembly algo- 
rithm. In 1995, the 1.8-Mbp Haemophilus 
influenzae genome was <=ompleted by a . 
whole-genome shotgun sequencmg method 
(13). The experience with several subsequent 
genome-sequencing efforts established Ae 
broad applicability of this approach {14. m 
A key feature of the sequencmg approach 
used for these megabase-size and larger ge- 
nomes was the use of paired-end sequence 

(also called mate pairs), denved from sub 
ilone Ubraries with distinct msert saes^^^^ 
cloning characteristics. Paired-end sequences 
are sequences 500 to 600 bp in leng^ from 
both ends of double-stranded DNA clones of 
prescribed lengths. The success of '^-g -d 
sequences from long segments (18 to 20 kbp) 
of DNA cloned into bacteriophage lambda m 
assembly of the microbial genomes led to the 
suggestion {16) of an approach to sunulta 



- <nmce. VOL 291 16 FEBRUARY 2001 
wwwiciencemag.org SCIENCE vul ^^^^^ 





neously map and sequence the human ge- 
nome by means of end sequences from 150- 
kbp bacterial artificial chromosomes (BACs) 
(77, 18). The end sequences spanned by 
known distances provide long-range continu- 
ity across the genome. A modification of the 
BAG end-sequencing (EES) method was ap- 
plied successfully to complete chromosomal 
from the Arabidopsis thaliana genome {19). 

In 1997, Weber and Myers {20) proposed 
whole-genome shotgun sequencing of the 
human genome. Their proposal was not well 
received {21). However, by early 1998, as 
less than 5% of the genome had been se- 
quenced, it was clear, that the rate of progress 
in human genome sequencing worldwide 
was very slow (22). and the prospects for 
finishing the genome by the 2005 goal were 
uncertain. 

In early 1998, PE Biosystems (now Applied 
Biosystems) developed an automated, high- 
throughput capillary DNA sequencer, subse- 
quenUy called the ABI PRISM 3700 DNA 
Analyzer. Discussions between PE Biosystems 
and TIGR scientists resulted in a plan to under- 
take the sequencing of the human genome with 
the 3700 DNA Analyzer and the whole-genome 
shotgun sequencing techniques developed at 
TIGR (25). Many of the principles of operation 
of a genome-sequencing facility were estab- 
lished in the TIGR facility {24). However, the 
facility envisioned for Celera would have a 
capacity roughly 50 times that of TIGR, and 
thus new developments were required for sam- 
ple preparation and tracking and for whole- 
genome assembly. Some argued that the re- 
quired 150-fold scale-up from the H. influenzae 
genome to the human genome with its complex 
repeat sequences was not feasible (25). The 
Drosophila melanogaster genome was thus 
chosen as a test case for whole-genome assem- 
bly on a large and complex eukaryotic genome. 
In collaboration with Gerald Rubin and the 
Berkeley Drosophila Genome Project, the nu- 
cleotide sequence of the 120-Mbp euchromatic 
portion of the Drosophila genome was deter- 
mined over a 1-year period {26-28). The Dro- 
sophila genome-sequencing effort resulted in 
two key findings: (i) that the assembly algo- 
rithms could generate chromosome assemblies 
with highly accurate order and orientation with 
substantially less than 10-fold coverage, arid (ii) 
that undertaking multiple interim assemblies in 
place of one comprehensive final assembly was 
not of value. 

These findings, together with the dramatic 
changes in the public genome effort subsequent 
to the formation of Celera {29), led to a modi- 
fied whole-genome shotgun sequencing ap- 
proach to the human genome. We initially pro- 
posed to do 10-fold sequence coverage of the 
genome over a 3-year period and to make in- 
terim assembled sequence data available quar- 
terly. The modifications included a plan to per- 
form random shotgun sequencing to -5-fold 
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coverage and to use the unordered and unori- 
ented BAG sequence fragments and subassem- 
bhes published in GenBank by the pubUcly 
funded genome effort {30) to accelerate the 
project We also abandoned the quarterly an- 
nouncements in the absence of interim assem- • 
blies to report 

Although this strategy provided a reason- 
able result very early that was consistent with a 
whole-genome shotgun assembly with eight-, 
fold coverage, the human genome sequence is 
not as finished as the Drosophila genome was 
with an effective 13-fold coverage. However, it 
became clear that even with this reduced cov- 
erage strategy, Celera could generate an accu- 
rately ordered and oriented scaffold sequence of 
the human genome in less than 1 year. Human 
genome sequencing was initiated 8 September 
1999 and completed 17 June 2000. The first 
assembly was completed 25 June 2000, and the 
assembly reported here was completed 1 Octo- 
ber 2000. Here we describe the whole-genome 
random shotgun sequencing effort applied to 
the human genome. We developed two differ- 
ent assembly approaches for assembling the -3 
biUion bp that make up the 23 pairs of chromo- 
somes of the Homo sapiens genome. Any Gen- 
Bank-derived data were shredded to remove 
potential bias to the fmal sequence from chi- 
meric clones, foreign DNA contamination, or 
misassembled contigs. Insofar as a correctly 
and accurately assembled genome sequence 
with faithfiil order and orientation of contigs 
is essential for an accurate analysis of the 
human genetic code, we have devoted a con- 
siderable portion of this manuscript to the 
documentation of the quality of our recon- 
struction of the genome. We also describe our 
preliminary analysis of the human genetic 
code on the basis of computational methods. 
Figure 1 (see fold-out chart associated with 
this issue; files for each chromosome can be 
found in Web fig. 1 on Science Online at 
www.sciencemag.org/cgi/content^full/291/ 
5507/1304/DCl) provides a graphical over- 
view of the genome and the features encoded 
in it. The detailed manual curation and inter- 
pretation of the genome are just beginning. 

To aid the reader in locating specific an- 
alytical sections, we have divided the paper 
into seven broad sections. A summary of the 
major results appears at the beginning of each 
section. 

1 Sources of DNA and Sequencing Methods 

2 Genome Assembly Strategy and 
Characterization 

3 Gene Prediction and Annotation 

4 Genome Structure 

5 Genome Evolution 

6 A Genome-Wide Examination of 
Sequence Variations 

7 An Overview of the Predicted Protein- 
Coding Genes in the Human Genome 

8 Conclusions 



1 Sources of DNA and Sequencing 
Methods ^ 

Summary. This section discusses the rationale 
and ethical rules governing donor scicciiun to 
ensure ethnic and gender diversity along with 
the methodologies for DNA extraction and Ii, 
braiy construction. The plasmid library corv- 
struction is the first critical step in shotgun 
sequencing. If the DNA libraries are not uni* 
form in size, nonchimeric, and do not randomly 
represent the genome, then the subsequent stcpj 
cannot accurately reconstruct the genome ^se- 
quence. We used automated high-throui:tipui 
DNA sequencing and the computational infra- 
structure to . enable efficient, tracking of cnor* 
mous amounts of sequence information (27.3 
million sequence reads; 14.9 billion bp of se- 
quence). Sequencing and tracking from both 
ends of plasmid clones from 2-, 10-, and 50-kbp 
libraries were essential to the computational 
reconstruction of the genome. Our evidence 
indicates that the accurate pairing rate of end 
sequences was greater than 98%. 

Various policies of the United States and tlic 
Worid Medical Association, specifically the 
Declaration of Helsinki, offer recommenda- 
tions for conducting experiments with human 
subjects. We convened an Institutional Re- 
view Board (IRB) {31) that helped us estab- 
lish the protocol for obtaining and using hu- 
man DNA and the informed consent process 
used to enroll research volunteers for the 
DNA-sequencing studies reported here. Wc 
adopted several steps and procedures to pu>- 
tect the privacy rights and confidentiality of 
the research subjects (donors). These includ- 
ed a two-stage consent process, a secure ran- 
dom alphanumeric coding system for speci- 
mens and records, circumscribed contact with 
the subjects by researchers, and options for 
off-site contact of donors. In addition, Celera 
applied for and received a Certificate of Con- 
fidentiality from the Department of Health 
and Human Services. This Certificate autho- 
rized Celera to protect the privacy of the 
individuals who volunteered to be donors as 
provided in Section 301(d) of the Public 
Health Service Act 42 U.S.C. 241(d). 

Celera and the IRB believed that the ini- 
tial version of a completed human genome 
should be a composite derived from multiple 
donors of diverse ethnic backgrounds Pro- 
spective donors ^yere asked, on a voluntary 
basis, to self-designate an ethnogeographic 
category (e.g., African-American, Chinese* 
Hispanic, Caucasian, etc.). We enrolled 21 
donors {32). 

Three basic items of information from 
each donor were recorded and linked by con- 
fidential code to the donated sample: age. 
sex, and self-designated ethnogeographic 
group. From females, -130 ml of whole 
heparinized blood was collected. From males. 
-130 ml of whole, heparinized blood was 
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collected, as well as five specimens of semen, 
collected over a 6-week period. Permanent 
lymphoblastoid cell lines were created by 
Epstein-Barr virus immortalization. DNA 
from five subjects was selected for genomic 
DNA sequencing: two males and three fe- 
males—one African-American, one Asian- 
Chinese, one Hispanic-Mexican, and two 
Caucasians (see Web fig. 2 on Science Online 
at www.sciencemag.org/cgi/content/291/5507/ 
1304/DCl). The decision of whose DNA to 
sequence was based on a complex mix of fac- 
'.tors, including the goal of achieving diversity as 
well as technical issues such as the quality of 
the DNA libraries and availability of immortal- 
ized cell lines. 

1.1 Library construction and 
sequencing 

Central to the whole-genome shotgun sequenc- 
ing process is preparation of high-quaUty plas- 
mid Ubraries in a variety of insert sizes so that 
pairs of sequence reads (mates) are obtained, 
one read from both ends of each plasmid msert. 
High-quaHty Ubraries have an equal representa- 
tion of aU parts of the genome, a small number 
of clones without inserts, and no contamination 
from such sources as the mitochondrial genome 
and Eschmchia coli genomic DNA. DNA from 
each donor was used to construct plasmid Ubrar- 
ies in one or more of three size classes: 2 kbp, 10 
kbp, and 50 kbp (Table 1) (53). 

In designing the DNA-sequencmg pro- 
cess we focused on developing a simple 
system that could be implemented in a robust 
and reproducible manner and monitored ef- 
fectively (Fig. 2) {34). 

Cun^nt sequencing protocols are based on 
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the dideoxy sequencing method (35), which 
typicaUy yields only 500 to 750 bp of sequence 
Lx reaction. This limitation on read length has 
made monumental gains in throughput a pre- 
requisite for the analysis of large eukaiyotic 
genomes. We accompUshed this at the Celera 
facility, which occupies about 30.000 square 
feet of laboratory space and produces sequence 
data continuously at a rate of 175,000 total 
reads per day. The DN A-sequencing facihty is 
..supported by a high-performance, computation- 
al facility (35). , • ... .. ■ . . , 
■ The process for DNA sequencmg was mod- 
ular by design and automated, Intermodule 
sample backlogs allowed four principal 
modules to operate independently: (i) li- 
brary transformation, plating, and colony 
picking; (ii) DNA template preparation; 
(iii) dideoxy sequencing reaction set-up 
and purification; and (iv) sequence deter- 
mination with the ABI PRISM 3700 DNA 
Analyzer. Because the inputs and ou^uts 
of each module have been carefully 
matched and sample backlogs are continu- 
ously managed, sequencing has proceeded 
without a single day's interruption since the 
initiation of the Drpsophih project in May 
1999 The ABI 3700 is a fiilly automated 
capillary array sequencer and as such can 
be operated with a minimal amount of 
hands-on time, cunently estimated at about 
15 min per day. The capillary system also 
facilitates correct associations of sequenc- 
ing traces with samples through the elimi- 
nation of manual sample loading and lane- 
tracking errors associated with slab gels^ 
About 65 production staff were hired and 
trained, and were rotated on a regular basis 



through the four production ihodules. A 
central laboratory information management 
system (LIMS) tracked all sample plates by 
unique bar code identifiers. The facility was 
supported by a quality control team that per- 
formed raw material and in-process testing 
and a quality assurance group with responsi- 
bilities including document control, valida- 
tion, and auditing of the facility. Critical to 
the success of the scale-up was the validation 
of all software and instrumentation before . 
implementation, and prod^iction-scale testing 
of any process changes. 

1.2 Trace processing 

An automated trace-processing pipeline has 
been developed to process each sequence file 
(37). After quality and vector trimming, the 
average trimmed sequence length was 543 
bp, and the sequencing accuracy was expo- 
nentially distributed with a mean of 99.5% 
and with less than 1 in 1000 reads being less 
than 98% accurate (2<5). Each trimmed se- 
quence was screened for matches to contam- 
inants including sequences of vector alone, E. 
coli genomic DNA, and human mitochondri- 
al DNA. The entire read for any sequence 
with a significant match to a contaminant was 
discarded. A total of 713 reads matched £. 
coli genomic DNA and 2114 reads matched 
the human mitochondrial genome. 

1.3 Quality assessment and control 

The importance of the base-pair level ac- 
curacy of the sequence data increases as the 
size and repetitive nature of the genome to 
be sequenced increases. Each sequence 
read must be placed uniquely in the ge- 



Table 1, Celera-generated data input into assembly. 



Number of reads for differe nt insert libraries 
10 kbp 



No. of sequencing reads 



Fold sequence coverage 
(2.9-Cb genome) 



Fold clone coverage 



Insert sire* (mean) 
Insert siie* (SD) 
% Matest 



A 
6 
C 
D 
F 

Total 
A 
B 
C 
D 
F 

Total 

A 
B 
C 
D 

. F 
Total 
Average 
Average 
Average 



0 

11,736.757 
853.819 
952.523 
0 

13.543,099 
0 

2.20 
0.16 
0.18 
0 

2.54 
0 

2.96 
0.22 
0.24 
0 

3.42 
1.951 bp 
6.10% 
74.50 



0 

7.467.755 
881.290 
1.046.815 
1.498.607 
10.894.467 
. 0 
1.40 

I. 17 
0.20 
0.28 
2.04 

0 

II. 26 
1.33 
1.58 
2.26 

16.43 

10.800 bp 
8.10% 
80.80 



50 kbp 


Total 


2.767.357 


2.767.357 


66.930 


19.271.442 


0 


1.735.109 


0 


1.999.338 


0 


1.498.607 


2.834.287 


27.271.853 


0.52 


0.52 


0.01 


3.61 


0 


0.32 


0 


0.37 


0 


0.28 


0.53 


5.11 


18.39 


18.39 


0.44 


14.67 


0 


1.54 


0 


1.82 


0 


2.26 


18.84 


38.68 


50.715 bp 




14.90% 




75.60 





Total number of 
base pairs 



'insert siie and SO are calculated from assembly of mates on contigs. 



"t% Mates U based on laboratory tracking of sequencing t 
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nome, and even a modest error rate can 
reduce the effectiveness of assembly. In 
addition, maintaining the validity of mate- 
pair information is absolutely critical for 
the algorithms described below. Procedural 
controls were established for maintaining 
the validity of sequence mate-pairs as se- 
quencing reactions proceeded through the 
process, including strict rules built into the 
LIMS. The accuracy of sequence data pro- 
duced by the Celera process was validated 
in the course of the Drosophila genome 
project (25). By collecting data for the 
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entire human genome in a single facility, 
we were able to ensure uniform quality 
standards and the cost advantages associat- 
ed with automation, an economy of scale, 
and process consistency. 

2 Genome Assembly Strategy and 
Characterization 

Summary, We describe in this section the two 
approaches that we used to assemble the ge- 
nome. One method involves the computational 
combination of all sequence reads with shred- 
ded data from GenBank to generate an indepen- 



dent, nonbiased view of the genome. The sec- 
ond approach involves clustering all of the frag- 
ments to a region or chromosome on the basis 
of mapping infomfiatioa The clustered data 
were then shredded and subjected to computa- 
tional assembly. Both approaches provided es- 
sentially the same reconstruction of assembled 
DNA sequence with proper order and orienta- 
tion. The second method provided slightly 
greater sequence coverage (fewer ga^js) and 
was the principal sequence used for the analysis 
. phase. In addition, we document the complete- 
ness and conrectness of this assenibly process 
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and provide a comparison to the pubUc genonie 
sequence, which was reconstructed largely by 
an independent BAC-by-BAC approach. Our 
assemblies effectively covered the euchromatic 
regions of the human chromosomes. More than 
90% of the genome was in scaffold assemblies 
of 100,000 bp or greater, and 25% of the ge- 
nome was in scaffolds of 10 milHon bp or 
larger. 

Shotgun sequence assembly is a classic 
example of an iiiverse problem: given a set 
of reads randomly sampled from a target 
sequence, reconstruct the order and the po- 
sition of those reads in the target. Genome 
assembly algorithms developed for Dro- 
sophila have now been extended to assemble 
the -25-fold larger human genome. Celera as- 
semblies consist of a set of contigs that are 
ordered and oriented into scaffolds that are then 
mapped to chromosomal locations by using 
known markers. The contigs consist of a col- 
lection of overlapping sequence reads that pro- 
vide a consensus reconstruction for a contigu- 
ous interval of the genome. Mate pairs are a 
central component of the assembly strategy. 
They are used to produce scaffolds in which the 
size of gaps between consecutive contigs is 
known with reasonable precision. This is ac- 
compUshed by observing that a pair of reads, 
one of which is in one contig, and the other of 
which is in another, implies an orientation and 
distance between the two contigs (Fig. 3). Fi- 
nally, our assemblies did not incorporate all 
reads into the final set of reported scaffolds. 
This set of unincorporated reads is termed 
"chaff," and typically consisted of reads from 
within highly repetitive regions, data from other 
organisms introduced through various routes as 
found in many genome projects, and data of 
poor quality or with untrinuned vector. 
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2.1 Assembly data sets 

We used two independent sets of data for our 
assembUes. The first was a random shotgun 
data set of 27.27 million reads of average length 
543 bp produced at Celera. This consisted 
largely of mate-pair reads from 16 Ubranes 
constructed from DNA samples taken froin five 
different donors. Libraries with insert sizes of 2, 
10, and 50 kbp were used. By looking at how 
mate pairs from a Ubrary were positioned m 
known sequenced stretches of the genome, we , 
were able to characterize the range of msert 
■ sizes in each library and detennine a mean and 
standard deviation. Table 1 details the number 
of reads, sequencing coverage, and clone cov- 
erage achieved by the data set. The clone cov- 
erage is the coverage of the genome m cloned 
DNA, considering the entire insert of each 
clone that has sequence from both ends. The 
clone coverage provides a measure of the 
amount of physical DNA coverage of the ge- 
nome. Assuming a genome size of 2.9 Gbp, the 
Celera trimmed sequences gave a 5.1 X cover- 
age of the genome, and clone coverage was 
3.42X, 16.40X,andl8.84X forthe2-, 10-,and 
50-kbp libraries, respectively, for a total of 
38.7X clone coverage. 

The second data set was from the publicly 
frmded Human Genome Project (PFP) and is 
primarily derived from BAC clones (50). The 
BAG data input to the assembUes came from a 
download of GenBank on 1 September 2000 
(Table 2) totaling 4443.3 Mbp of sequence 
The data for each BAC is deposited at one of 
four levels of completion. Phase 0 data are a set 
of gerierally unassembled sequencing reads 
from a very light shotgun of the BAC, typically 
less than IX. Phase 1 data are unordered as- 
sembUes of contigs, which we call BAC contigs 
or bactigs. Phase 2 data are ordered assembUes 
of bactigs. Phase 3 data are complete BAC 
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sequences. In the past 2 years the PFP has 
focused on a product of lower quaUty and com- 
pleteness, but on a faster time-course, by con- 
centrating on the production of Phase 1 data 
from a 3X to 4X light-shotgun of each BAG 
clone. 

We screened the bactig sequences for con- 
taminants by using the BLAST algorithm 
against three data sets: (i) vector sequences 
in Univec core {38), filtered for a 25-bp 
. match at 98% sequence identity at the ends 
of the sequence and a 30-bp match internal 
to the sequence; *(ii) the nonhuman portion 
of the High Throughput Genomic (HTG) 
Seqences division of GenBank {39), fil- 
tered at 200 bp at 98%; and (iii) the non- 
redundant nucleotide sequences from Gen- 
Bank without primate and human virus en- 
tries, filtered at 200 bp at 98%. Whenever 
25 bp or more of vector was found within 
50 bp of the end of a contig. the tip up to 
the matching vector was excised. Under 
these criteria we removed 2.6 Mbp of pos- 
sible contaminant and vector from the 
Phase 3 data, 61.0 Mbp from the Phase 1 
and 2 data, and 16.1 Mbp from the Phase 0 - 
data (Table 2). This left us with a total of 
4363.7 Mbp of PFP sequence data 20% 
finished, 75% rough-draft (Phase 1 and 2), 
and 5% single sequencing reads (Phase 0). 
An additional 104,018 BAC end-sequence 
mate pairs were also downloaded and in- 
cluded in the data sets for both assembly 
processes {18). 

2.2 Assembly strategies 

Two different approaches to assembly were 
pursued. The first was a whole-genome as- 
sembly process that used Celera data and the 
PFP data in the form of additional synthetic 
shotgun data, and the second was a compart- 
mentalized assembly process that first parti- 
tioned the Celera and PFP data into sets 
localized to large chromosomal segments and 
then perfomied ab initio shotgun assembly on 
each set. Figure 4 gives a schematic of the 
overall process flow. 

For the whole-genome assembly, the PFP 
data was first disassembled or "shredded" into a 
synthetic shotgun data set of 550-bp reads^t 
fomi a perfect 2X covering of.the bactigs. This 
resulted in 16.05 mUUon * W reads that were 
sufficient to cover the genome 2.96X because 
of redundancy in the BAC data set, witho^ 
incorporating the biases inherent in the PFP 
assembly process. The combined data set of 
43.32 miUion reads (8X), and all associated 
mate-pair information, were then subjected to 
our whole-genome assembly algorithm to pro- 
duce a reconstmction of the genome. Neither 
the location of a BAC in the genome nor its 
assembly of bactigs was used in this process. 
Bactigs were shredded into reads because we 
found strong evidence that 2.13% of them were 
misassembled {40). Furthemaore, BAC location 
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information was ignored because some BACs 
were not correctly placed on the PFP physical 
map and because we found strong evidence that 

Table 2. GenBank data input into assembly. 
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at least 11% of the BACs contained sequence 
data that were not part of the given BAG {41\ 
possibly as a result of sample-tracking errors 



Completion phase sequence 



Center 



Statistics 



1 and 2 



Whitehead Institute/ 
MIT Center for 
Genome Research, 
USA 



Washington University, 
USA 



Baylor College of 
Medicine, USA 



Production Sequencing 
Facility, DOE Joint 
Genome Institute. 
USA 



The Institute of Physical 
and Chemical 
Research (RIKEN). 
Japan 



Sanger Centre, UK 



Others* 



2,825 
243,785 



Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked 
(bp) 

.Average contig length (bp) 
Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked 
(bp) 

Average contig length (bp) 
Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked 
(bp) 

Average contig length (bp) 
Number of accession records 
Number of contigs 
Total base pairs . 
Total vector masked (bp) 
Total contaminant masked 

Average contig length (bp) 
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Number of contigs 
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Total contaminant masked (bp) 
Average contig length (bp) 
Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Tdtal contaminant masked (bp) 
Average contig length (bp) 
Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked 
(bp) 

Average contig length (bp) 
Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked 
(bp) 

Average contig lengt h (bp) ^ ^ — 
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Research; The Institute of Physical ^nd Cheml«l Re«a^. ^^."^f/J*^^;^^^^^ contributed by all centers were 
Southwestern Medical Center. University of Washington. t The 4,^5.700.8^5 
shredded Into faux reads resulting In 2.96X coverage of the genome. 
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13,654,482 
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20.093.926 
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2.371 
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27,781 


7,093 


66,978 


4.538 


2.599 
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689.059,692 


246,118,000 


427.326 


25,054 


2,066.305 


374,561 


9.271 


94.697 


1.894 


3.458 


29.898 


3.458 


283.358.877 


246.474.157 


279.477 


32.136 


1,616,665 


1,791,849 


9,478 


71.277 


21.015 


9.137 
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9.137 


3.360.047,574 


835.722.268 


2.438.575 


82.284 


16.311,664 


3.365.230 


8,203 
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(see below). In short, we perfomied a true, ab 
initio whole-genome assembly in which wt 
took the expedient of deriving additional se- 
quence coverage, but not mate pairs, assembled 
bactigs, or genome locality, from some exter- 
nally generated data. 

In the compartmentalized shotgun assembl) 
(CSA), Celera and PFP data were partiiionct 
into the largest possible chromosomal segment; 
or "components" that could be detemiined wli) 
confidence, and then shotgun assembly was op 
plied to each partitioned subset wherein lh< 
bactig data were again shredded into faux read 
to ensure an independent ab initio assembly o 
the component. By subsetting the data in thi 
yray, the overall computational effort was rc 
duced and the effect of inteichromosomal dupli 
cations was ameliorated. This also resulted in 
reconstruction of the genome that was relativcl 
independent of the whole-genome assembly rc 
suits so that the two assemblies could be con 
pared for consistency. The^quality of ihc part 
tioning into components was crucial so llr 
different genome regions were not mixed K 
gether. We constructed components from (i) il 
longest scaffolds of the sequence from c;k 
BAC and (ii) assembled scaffolds of data unicn 
to Celera's data set. TTie BAC assemblies wc 
obtained by a combining assembler that used t! 
bactigs and the 5X Celera data mapped to tho 
bactigs as input This effort was undertaken 
an interim step solely because the more accur.^ 
and complete the scaffold for a given sequcn 
stretch, the more accurately one can tilc tiK- 
scaffolds into contiguous components on i 
basis of sequence overiap and mate-pair nil' 
mation. We fiarther visually inspected and i 
rated the scaffold tiling of the component 
fiirther increase its accuracy. For the final t.. 
assembly, all but the partitioning was ignon 
and an independent, ab initio reconstruction 
the sequence in each component was obta r 
by applying our whole-genome assembly « - 
ritiun to the partitioned, relevant Celera data . 
the shredded, faux reads of the partUionoii. . 
eyant bactig data. 

2.3 Whole-genome assembly 
The algorithms used for whole-gcnonic 
sembly (WGA) of the human genomcj 
enhancements to those used to produce 
sequence of the Drosophila genome rcpo 
in detail in (28). . 

The WGA assembler consists of a pi| ^ 
composed of five principal stages: Scit 
Overlapper. Unitigger. Scaffolder. and l< 
Resolver, respectively. The Scrccnc 
and marks all microsatellite repeats w > 
than a 6-bp element, and screens ou 
known interspersed repeat e emcnts. 
ingAlu. Line, and ribosomal DNA. 
regions get searched for overlaps, 
screened regions do not get searched, p" 
be part of an overiap that involves unscr 
matching segments. 
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The Overlapper compares every read 
against every other read in search of complete 
end-to-end overlaps of at least 40 bp and with 
no more than 6% differences in ihe match. . 
Because all data are scrupulously vector- 
trimmed, the Overlapper can insist on com- 
plete overlap matches. Computing the set of 
all overlaps took roughly 10,000 CPU hours 
with a suite of four-processor Alpha SMPs 
with 4 gigabytes of RAM. This took 4 to 5 . 
days in elapsed time with 40 such machines . 
operatirig in parallel.- • ■ 

Every overlap computed above is statisti- 
cally a 1-in-lO^^ event and thus not a coinci- 
dental event. What makes assembly combi- 
natorially difficult is that while many over- 
laps are actually sampled from overlapping 
regions of the genome, and thus imply that 
the sequence reads should be assembled to- 
gether, even more overlaps are actually from 
two distinct copies of a low-copy repeated 
element not screened above, thus constituting 
an error if put together. We call the former 
"true overlaps" and the latter "repeat-induced 
overlaps." The assembler must avoid choos- 
ing repeat-induced overlaps, especially early 
in the process. 

We achieve this objective in the Unitig- 
ger. We first fmd all assemblies of reads that 
appear to be uncontested with respect to all 
other reads. We call the contigs formed from 
these subassemblies unitigs (for uniquely as- 
sembled contigs). Formally, these unitigs are 
the uncontested interval subgraphs of the 
graph of all overlaps (42). Unfortunately, al- 
though empirically many of these assemblies 
are correct (and thus involve only true over- 
laps), some are in fact collections of reads 
from several copies of a repetitive element 
that have been overcollapsed into a single 
subassembly. However, the overcollapsed 
unitigs are easily identified because their av- 
erage coverage depth is too high to be con- 
sistent with the overall level of sequence 
coverage. We developed a simple statistical 
discriminator that gives the logarithm of the 
odds ratio that a unitig is composed of unique 
DNA or of a repeat consisting of two or more 
copies. The discriminator, set to a sufficiently 
stringent threshold, identifies a subset of the 
unitigs that we are certain are correct. In 
addition, a second, less stringent threshold 
identifies a subset of remaming unitigs very 
likely to be correctly assembled, of which we 
select those that will consistently scaffold 
(see below), and thus are agam ahnost certain 
to be correct. We call the union of these two 
sets U-unitigs. Empiricaliy, we found from a 
6X simulated shotgun of human chromosome 
22 that we get U-unitigs covering 98% of the 
stretches of imique DNA that are >2 kbp 
long. We are further able to identify the 
boundary of the start of a repetitive element 
at the ends of a U-unitig and leverage this so 
that U-unitigs span more than 93% of all 
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singly interspersed Alu elements and other 
100-to 400-bp repetitive segments. 

The result of running the Unitigger was 
thus a set of correctly assembled subcontigs 
covering an estimated 73.6% of the human 
genome. The Scaffolder then proceeded to 
use mate-pair information to link these to- 
gether" into scaffolds. When there are two or 
more mate pairs that imply that a given pair 
of U-unitigs are at a certain distance and 
orientation with, respect to each other, the ; 
probability^ of this being wrong * is again 
roughly 1 in 10*^. assuming that mate pairs 
are false less than 2% of the time. Thus, one 
can with high confidence link together all 
U-unitigs that are linked by at least two 2- or 
10-kbp mate pairs producing intermediate- 
sized scaffolds that are then recursively 
linked together by confuming- 50-kbp mate 
pairs and BAC end sequences. This process 
yielded scaffolds that are on the order of 
megabase pairs in size with gaps between 
their contigs that generally correspond to re- 
petitive elements and occasionally to small 
sequencing gaps. These scaffolds reconstruct 
the majority of the unique sequence within a 
genome. 

For the Drosophila assembly, we engaged 
in a . three-stage repeat resolution strategy 
where each stage was progressively more 



5.11XCelera Reads 
39X mate pairs 



aggressive and thus more likely to make a 
mistake. For the human assembly, we contin- 
ued to use the fust "Rocks" substage where 
all unitigs with a good, but not definitive, 
discriminator score are placed in a scaffold 
gap. This was done with the condition that 
two or more mate pairs with one of their 
reads already in the scaffold imambiguously 
place the imitig in the given gap. We estimate 
the. probability of inserting a unitig into an 
incorrect gap with this strategy to be less than 
10""^ based on z probabilistic analysis. 

We revised the ensuing "Stones" substage 
of the human assembly, making it more like 
the mechanism suggested in our earlier work 
(43). For each gap, every read R that is placed 
in the gap by virtue of its mated pair M being 
in a contig of the scaffold and implying R's 
placement is collected. Celera's mate-pairing 
information is correct more than 99% of the 
time. Thus, almost every, but not all, of the 
reads in the set belong in the gap, and when 
a read does not belong it rarely agrees with 
the remainder of the reads. Therefore, we 
simply assemble this set of reads ^xHb^ the 
gap, eliminating any reads that conflict with 
the assembly. This operation proved much 
more reliable than the one it replaced for the 
Drosophila assembly; in the assembly of a 
V simulated shotgun data set of human chromo- 
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some 22, all stones were placed correctly. 

The final method of resolving gaps is to 
fill them with assembled BAG data that cover 
the gap. We call this extemal gap '*walking." 
We did not include the very aggressive "Peb- 
bles" substage described in our Drosophila 
work, which made enough mistakes so as to 
produce repeat reconstructions for long inter- . 
spersed elements whose quality was only 
99.62% correct. We decided that for the hu- 
man genome it was philosophically better not 
to introduce a step that was certain to produce 
less than 99.99% accuracy. The cost was a 
somewhat larger number of gaps of some- 
what larger size. 

• At the final stage of the assembly process, 
and also at several intermediate points, a 
consensus sequence of every contig is pro- 
duced. Our algorithm is driven by the princi- 
ple of maximum parsimony, with quality- 
value-weigbted measures for evaluating each 
base. The net effect is a Bayesian estimate of 
the correct base to report at each position. 
Consensus generation uses Celera data when- 
ever it is present. In the event that no Celera 
data cover a given region, the BAC data 
sequence is used. 

A key element of achieving a WGA of the 
human genome was to parallelize the Overlap- 
per and the central consensus sequence-con- 
structmg subroutines. In addition, memory was 
a . real issue — a straightforward application of 
the software we had built for Drosophila would 



have required a con:^)uter with a 600-gigabyte 
RAM. By making the Overlapper and Unitigger 
incremental, we were able to achieve the same 
computation with a maximum of instantaneous 
usage of 28 gigabytes of RAM. Moreover, the 
incremental nature of the first three stages al- 
lowed us to continually update the state of this 
part of the computation as data were delivered 
and then perfomi a 7-day run to complete Scaf- 
folding and Repeat Resolution whenever de- 
sired. For our assembly operations, the total 
compute infrastructure consists of 10 four-pro- 
cessor SMPs with 4 gigabytes of memory per 
cluster (Con^aq's ES40, Regatta) and a 16- 
processor NUMA machine with 64 gigabytes 
of memory (Compaq's GS160, Wildfire). The 
• total compute for a run of the assembler was 
roughly 20,000 CPU hours. 

The assembly of Celera's data, together 
with the shredded bactig data, produced a set of 
scaffolds totaling 2.848 Gbp in spari and con- 
sisting of 2.586 Gbp of sequence. The chaff, or 
set of reads not incorporated in the assembly, 
numbered 11.27 million (26%), which is con- 
sistent with our experience for Drosophila. 
More than 84% of the genome was covered by 
scaffolds , >100 kbp long, and these averaged 
91% sequence and 9% gaps with a total of 
2.297 Gbp of sequence. There were a total of 
93,857 gaps among the 1637 scaffolds >100 
kbp. The average scaffold size was 1.5 Mbp, 
the average contig size was 24.06 kbp, arid the 
average gap size was 2.43 kbp, where the dis- 



tribution of each was essentially exponential 
More than 50% of all gaps were less than 50C 
bp long, >62% of all gaps were less than 1 kb; 
long, and no gap. was >100 kbp long. Similar- 
ly, more than 65% of the sequence is in contigs 
>30 kbp, more than 31% is in contigs >100 
kbp, and the largest contig was 1.22 Mbp long. 
Table 3 gives detailed sunimary statistics for 
the structure of this assembly with a direct 
comparison to the compartmentalized shotgun 
assembly.. 

2.4 Compartmentalized shotgun 
assembly 

In addition to the WGA approach, we pur- 
sued a localized assembly " approach that was 
intended to subdivide the genome into seg- 
ments, each of which could be shotgun as- 
sembled individually. We expected that this 
would help in resolution of large interchro- 
mosomal duplications and improve the statis- 
tics for calculating U-unitigs. The compart- 
mentalized assembly process involved clus- 
tering Celera reads and bactigs into large, 
multiple megabase regions of the genome, 
and then rurming the WGA assembler on the 
Celera data and shredded, faux reads ob- 
tained from the bactig data. 

The first phase of the CS A strategy was to 
separate Celera reads into those that matched 
the BAC contigs for a particular PFP BAC 
entry, and those that did not match any public 
data. Such matches must be guaranteed to 



Table 3. Scaffold statistics for whole-genome and compartmentalized shotgun assemblies. 



Scaffold size 





All 


>30 kbp 


>100 kbp 


>500 kbp 


>1000 kbp 






Compartmentalized shotgun assembly 






No. of bp in scaffolds 


2.905.568.203 


2.748.892,430 


2.700.489,906 


2.489.357.260 


2.248.689.128 


(including intrascaffold gaps) 












No. of bp in contigs 


2.653.979.733 


2,524.251.302 


2,491,538.372 


2,320.648.201 


2,106.521.902 


No. of scaffolds 


53.591 


2.845 


1.935 


1.060 


721 


No. of contigs 


170.033 


112,207 


107,199 


93.138 


82.009 


No. of gaps 


116.442 


109.362 


105,264 


92.078 


81.288 


No. of gaps :21 kbp 


72.091 


69,175 


67,289 


59.915 


53.354 


Average scaffold size (bp) 


54.217 


966,219 


1.395,602 


2,348,450 


3.118.848 


Average contig size (bp) 


15,609 


22.496 


23.242 


24.916 


25,686 


Average Intrascaffold gap size 


2,161 


2,054 


1.985 


1.832 


1.749 


(bp) 










1,988.321 


Largest contig (bp) 


1.988,321 


1.988,321 


1,988.321 


1.988.321 


% of total contigs 


100 


95 


1 " 


87 


79 






Whole-genome assembly 








No. of bp In scaffolds 


2.847.890.390 
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23,534 


24.061 
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properly place a Celera read, so all reads were 
first masked against a library of common 
repetitive elements, and only matches of at 
^^^t 40 bp to unmasked portions . of the read 
^■stituted a hit. Of Celera's 27.27 million 
^Rds, 20.76 million matched a bactig and 
another 0.62 million reads, which did not 
have any matches, were nonetheless identi- 
fied as belonging in the region of the bactig's 
BAG because their mate matched the bactig. 
Of the remaining reads, 2.92 . million were 
completely screened out and so could not be 
matched, but the other 2.97 million reads had 
unmasked sequence totaling 1.189 Gbp that 
were not found in the GenBank data set. 
Because the Celera data are 5.11 X redundant, 
we estimate that 240 Mbp of unique Celera 
sequence is not in the GenBank data set. 

In the next step of the CSA process, a 
combining assembler took the relevant 5X 
Celera reads and bactigs for a BAG entry, and 
produced an assembly of the combined data 
for that locale. These high-quality sequence 
reconstructions were a transient result whose 
utility was simply to provide more reliable 
information for the purposes of their tiling 
into sets of overlapping and adjacent scaffold 
sequences in the next step. In outline, the 
combining assembler first examines the set of 
matching Celera reads to detemiine if there 
are excessive pileups indicative of un- 
screened repetitive elements. Wherever these 

•occur, reads in the repeat region whose mates 
lave not been mapped to consistent positions 
are removed. Then all sets of mate pairs that 
consistently imply the same relative position 
of two bactigs are bundled into a link arid 
weighted according to the number of mates in 
the bundle. A "greedy" strategy then attempts 
to order the bactigs by selecting bundles of 
mate-pairs in order of their weight. A selected 
mate-pair bundle can tie together two forma- 
tive scaffolds. It is incorporated to form a 
single scaffold only if it is consistent with the 
majority of links between contigs of the scaf- 
fold. Once scaffolding is complete, gaps are 
filled by the "Stones" strategy described 
above for the WGA assembler. 

The GenBank data for the Phase 1 and 2 
BACs consisted of an average of 19.8 bactigs 
per BAG of average size 8099 bp. Applica- 
tion of the combinmg assembler resulted in 
individual Celera BAG assemblies being put 
together into an average of 1.83 scaffolds 
(median of 1 scaffold) consisting of an aver- 
age of 8.57 contigs of average size 18,973 bp. 
In addition to defining order and orientation 
of the sequence fragments, there were 57% 
fewer gaps in the combined result For Phase 
0 data, the average GenBank entry consisted 
of 91.52 reads of average length 784 bp. 
Application of the combining assembler re- 
sulted in an average of 54.8 scaffolds consist- 
ing of an average of 58.1 contigs of average 
size 873 bp. Basically, some small amount of 
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assembly took place, but not enough Celera 
data were matched to truly assemble the 0.5 X 
to IX data set represented by the typical 
Phase 0 BACs. The combining assembler 
was also applied to the Phase 3 BACs for 
SNP identification, confirmation of assem- 
bly, and localization of the Celera reads. The 
phase 0 data suggest that a combined whole- 
genome shotgun data set and IX light-shot- 
gun of BACs will not yield good assembly of 
BAG regions; at least 3 X light-shotgun of 
each BAG is needed. . . • . : . \ 

The 5.89 million Celera firagments not 
matching the GenBank data were assembled 
with our whole-genome assembler. The as- 
sembly resulted in a set of scaffolds totaling 
442 Mbp in span and consisting of 326 Mbp 
of sequence. More than 20% of the scaffolds 
were >5 kbp long, and these averaged 63% 
sequence and 27% gaps with a total of 302 
Mbp of sequence. All scaffolds >5 kbp were 
forwarded along with all scaffolds produced 
by the combining assembler to the subse- 
quent tiling phase. 

At this stage, we typically had one or two 
scaffolds for every BAG region constituting 
at least 95% of the relevant sequence, and a 
collection of disjoint Celera-unique scaffolds. 
The next step in developing the genome com- 
ponents was to determine the order and over- 
lap tiling of these BAG and Celera-unique 
scaffolds across the genome. For this, we 
used Celera's 50-kbp mate-pairs information, 
and B AG-end pairs {18) and sequence tagged 
site (STS) markers (^^) to provide long- 
range guidance and chromosome separation. 
Given the relatively manageable number of 
scaffolds, we chose not to produce this tiling 
in a fiilly automated manner, but to compute 
an initial tiling with a good heuristic and then 
use human curators to resolve discrepancies 
or missed join opportunities. To this end, we 
developed a graphical user interface that dis- 
played the graph of tiling overlaps and the 
evidence for each. A human curator could 
then explore the implication of mapped STS 
data, dot-plots of sequence overlap, and a 
visual display of the mate-pair evidence sup- 
porting a given choice. The result of this 
process was a collection of "components," 
where each component was a tiled set of 
BAG and Celera-unique scaffolds that had 
been curator-approved. The process resulted 
in 3845 components with an estimated span 
of 2.922 Gbp. 

In order to generate the fmal CSA, we 
assembled each component with the WGA 
algorithm. As was done in the WGA process, 
the bactig data were shredded into a synthetic 
2X shotgun data set in order to give the 
assembler the fireedom to independently as- 
semble the data. By using faux reads rather 
than bactigs, the assembly algorithm could 
correct errors in the assembly of bactigs and 
remove chimeric content in a PFP data entry. 



Ghhneric or contaminating sequence (from 
another part of the genome) would not be 
incorporated into the reassembly of the com- 
ponent because it did not belong there. In 
effect, the previous steps in the CSA process 
served only to bring together Celera frag- 
ments and PFP data relevant to a large con- 
tiguous segment of the genome, wherein we 
applied the assembler used for WGA to pro- 
duce an ab initio assembly of the region. 

WGA assembly of the components result- 
ed in a set of scaffolds totaling 2:906 Gbp in , 
span and consisting of .2.654 Gbp of se- 
quence. The chaff, of set of reads not incor- 
porated into the assembly, numbered 6.17 
million, or 22%. More than 90.0% of the 
genome was covered by scaffolds spanning 
>100 kbp long, and these averaged 92.2% 
sequence and 7.8% gaps with a total of 2.492 
Gbp of sequence. There were a total of 
105,264 gaps among the 107,199 contigs that 
belong to the 1940 scaffolds spanning >100 
kbp. The average scaffold size was 1.4 Mbp, 
the average contig size was 23.24 kbp, and 
the average gap size was 2.0 kbp where each 
distribution of sizes was exponentiaj. As 
such, averages tend to be underrepresentative 
of the majority of the data. Figure 5 shows a 
histogram of the bases in scaffolds of various 
. size ranges. Consider also that more than 
49% of all gaps were <500 bp long, more 
than 62% of all gaps were <1 kbp. and all 
gaps are < 100 kbp long. Similarly, more than 
73% of the sequence is in contigs > 30 kbp. 
more than 49% is in contigs > 100 kbp, and 
the largest contig was 1.99 Mbp long. Table 3 
provides summary statistics for the structure 
of this assembly with a direct comparison to 
the WGA assembly. 

2.5 Comparison of the WGA and CSA 
scaffolds 

Having obtained two assemblies of the hu- 
man genome via independent computational 
processes (WGA and CSA). we compared 
scaffolds from the two assemblies as another 
means of investigating their completeness, 
consistency, and contiguity. From each as- 
sembly, a set of reference scaffolds contain- 
ing at least 1000 fragments (Celera sequenc- 
ing reads or bactig shreds) was obtained; this 
amounted to 2218 WGA scaffolds and 1717 
CSA scaffolds, for a total of 2.087 Gbp and 
2.474 Gbp. The sequence of each reference 
scaffold was compared to the sequence of all 
scaffolds from the other assembly with which 
it shared at least 20 fragments or at least 20% 
of the fragments of the smaller scaffold. For 
each such comparison, all matches of at least 
200 bp with at most 2% mismatch were 
tabulated. 

From this tabulation, we estimated the 
amount of unique sequence in each assembly 
in two ways. The first was to determine the 
number of bases of each assembly that were 
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not covered by a matching segment in the 
other assembly. Some 82.5 Mbp of the WGA 
(3.95%) was not covered by the CSA, where- 
as 204.5 Mbp (8.26%) of the CSA was not 
covered by the WGA. This estimate did not 
require any consistency of the assemblies or 
any imiqueness of the matching segments. 
Thus, another analysis was conducted in , 
which matches of less than 1 kbp between a 
pair of scaffolds were excluded unless they 
were confirmed by other matches having a : 
consistent order and orientation. This gives 
some measure of consistent coverage: 1.982 
Gbp (95.00%) of the WGA is covered by the 
CSA, and 2.169 Gbp (87.69%) of the CSA is 
covered by the WGA by this more stringent 
measure. ,r 

The comparison of WGA to CSA also 
permitted evaliiation of scaffolds for structur- 
al inconsistencies. We looked for instances in 
which a large section of a scaffold firom one 
assembly matched only one scaffold from the 
other assembly, but failed to match over the 
full length of the overiap implied by the 
matching segments. An initial set of candi- 
dates was identified automatically, and then 
each candidate was inspected by hand. From 
this process, we identified 31 instances in 
which the assemblies appear to disagree m a 
nonlocal fashion. These cases are being fur- 
ther evaluated to detemiine which assembly 
is in error and why. 

In addition, we evaluated local inconsis- 
tencies of order or orientation. The following 
results exclude cases in which one contig in 
one assembly corresponds to more than one 
overlapping contig in the other assembly (as 
long as the order and orientation of the latter 
agrees with the positions they match in the 
former). Most of these small rearrangements 
involved segments on the order of hundreds 
of base pairs and rarely >1 kbp. We found a 
total of 295 kbp (0.012%) in the CSA assem- 
blies that were locally inconsistent with the 
WGA assemblies, whereas 2.108 Mbp 
(0.11%) in the WGA assembly were incon- 
sistent with the CSA assembly. 



The CSA assembly was a few percentage 
points better in tenns of coverage and slightly 
more consistent than the WGA, because it 
was in effect performing a few thousand shot- 
gun assemblies of megabase-sized problems, 
whereas the WGA is performing a shotgun 
assembly of a gigabase-sized problem. When 
one considers the increase of two-and-a-half 
orders of magnitude in problem size, the in- 
. formation loss between the two is remarkably 
small. Because CSA was logistically easier to 
deliver and the better of the two results avail- 
able at the time when downstream analyses 
needed to be begun, all subsequent analysis 
was performed on this assembly. • 

2.6 Mapping scaffolds to the genome 

The final step in assembling the genome was to 
order and orient the scaffolds on the chromo- 
somes. We first grouped scaffolds together on 
the basis of their order in the components firom 
CSA. These grouped scaffolds were reordered 
by examining residual mate-pairing data be- 
tween the scaffolds. We next mapped the scaf- 
fold groups onto the chromosome using physi- 
cal mapping data. This step depends on having 
reliable hi^-resolution map information such 
that each scaffold will overiap multiple mark- 
ers. There are two genome-wide types of map 
infomiation available: high-density STS maps 
and fingerprint maps of BAC clones developed 
at Washington University (45). Among the ge- 
nome-wide STS maps, GeneMap99 (GM99) 
has the most markers and therefore was most 
useful for mapping scaffolds. The two different 
mapping approaches are complementary to one 
another. The fingerprint maps should have bet- 
ter local order because they were built by com- 
parison of overiapping BAC clones. On the 
other hand, GM99 should have a more reliable 
long-range order, because the framework mark- 
ers were derived from well-validated genetic 
maps. Both types of maps were used as a 
reference for human curation of the compo- 
nents that were the input to the regional assem- 
bly, but they did not determine the order of 
sequences produced by the assembler. 



In order to determine the effectiveness of 
the fingerprint maps and GM99 for mapping 
scaffolds, we first examined the reliability of 
these maps by comparison with large scaf- 
folds. Only 1% of the STS markers on the 10 
largest scaffolds (those >9 Mbp) were 
mapped on a different chromosome on 
GM99. T\vo percent of the STS markers dis- 
agreed in position by more tiian five frame- 
work bins. However, for the fmgerprint 
maps, a 2% chromosome discrepancy was 
observed, and on average 23.8% of BAC 
locations in the scaffold sequence disagreed 
with fingerprint map placement by more than 
five BACs. When further examining the 
source of discrepancy, it was found that most 
of the discrepancy came from 4 of the 10. 
scaffolds, indicating this there is variation in 
the quality of either the map or the scaffolds. 
All four scaffolds were assembled, as well as 
the other six, as judged by clone coverage 
analysis, and showed the same low discrep- 
ancy rate to GM99, and thus we concluded 
that the fingerprint map global order in these 
cases was not reliable. Smaller scaffolds had 
a higher discordance rate with GM99 (4.21% 
of STSs were discordant by more than five 
firamework bins), but a lower discordance rate 
with the fmgerprint maps (11% of BACs 
disagreed with firigerprint maps by more than 
five BACs). This observation agrees with the 
clone coverage analysis (46) that Celera scaf- 
fold construction was better supported by 
long-range mate pairs in larger scaffolds than 
in small scaffolds. 

We created two orderings of Celera scaf- 
folds on the basis of the markers (BAC or 
STS) on these maps. Where the order of 
scaffolds agreed between GM99 and the 
WashU BAC map, we had a high degree of 
confidence that that order was correct; these 
scaffolds were termed "anchor scaffolds.** 
Only scaffolds with a low overall discrepancy 
rate with both maps were considered anchor 
scaffolds. Scaffolds in GM99 bins were al- 
lowed to permute in their order to match 
WashU ordering, provided they did not vio- 
late their framework orders. Orientation of 
individual scaffolds was determined by the 
presence of multiple mapped markers with 
consistent order. Scaffolds with only one 
marker have insufficient information to as- 
sign orientation. We found 70.1% of the ge- 
nome in anchored scaffolds, more than 99% 
of which are also oriented (Table 4). Because 
GM99 is of lower resolution than the WashU 
map, a number of scaffolds without STS 
matches could be ordered relative to the an- 
chored scaffolds because they included se- 
quence from the same or adjacent BACs on 
the WashU map. On the other hand, because 
of occasional WashU global ordering dis- 
crepancies, a number of scaffolds determined 
to be •'unmappable" on the WashU map could 
be ordered relative to the anchored scaffolds 



50% 

45% 

g 40% • 

1 35% 

w 30% 

i 25% 

"5 20% - 

I 15% - 

o 10% - 
a. 

5%- 
0% 



1-5 Mb 5-10 Mb > 10Mb 



< 30 kb 30-50 kb 50-100 kb 100-500 kb 0.5-1 Mb 

Scaffold Size 

Fig. 5. Distribution of scaffold sizes of the CSA For each range of scaffold sizes, the percent of total 
sequence is indicated. 



1314 



16 FEBRUARY 2001 VOL 291 SCIENCE www.sdencemag.org 



with GM99. These scaffolds were tenned 
"ordered scaffolds." We found that 13.9% of 
the assembly could be ordered by these ad- 
^^nal methods, and thus 84.0% of the ge- 
I^Be was ordered unambiguously. 
^^^Iext, all scaffolds that could be placed, 
. but not ordered, between anchors were as- 
signed to the interval between the anchored 
scaffolds and were deemed to be "bound- 
ed" between them. For example, small scaf- 
folds having STS hits from the same Gene- 
Map bin or hitting the same BAG cannot be 
ordered relative to each other, but can be 
assigned a placement boundary relative to 
other anchored or ordered scaffolds. The 
remaining scaffolds either had no localiza- 
tion information, conflicting information, 
or could only be assigned to a generic 
chromosome location. Using the above ap- 
proaches, --98% of the genome was an- 
chored, ordered, or bounded. 

Finally, we assigned a location for each 
scaffold placed on the chromosome by 
spreading out the scaffolds per chromosome. 
We assumed that the remaining unmapped 
scaffolds, constituting 2% of the genome, 
were distributed evenly across the genome, 
t By dividmg the sum of immapped scaffold 
lengths with the sum of the number of 
mapped scaffolds, we arrived at an estimate 
of interscaffold gap of 1483 bp. This gap was 
used to separate all the scaffolds on each 

•omosome and to assign an offset in the 
omosome. 
During the scaffold-mapping effort, we en- 
countered many problems Uiat resulted in addi- 
tional quality assessment and validation analy- 
sis. At least 978 (3% of 33,173) BACs were 
believed to have sequence data from more than 
one location in the genome (47). This is con- 
sistent with the bactig chimerism analysis re- 
ported above in the Assembly Strategies sec- 
tion. These BACs could not be assigned to 
unique positions within the CSA assembly and 
thus could not be used for ordering scaffolds. 
Likewise, it was not always possible to assign 
STSs to unique locations in the assembly be- 
cause of genome duplications, repetitive ele- 
ments, and pseudogenes. 

Because of the time required for an ex- 
haustive search for a perfect overlap, CSA 
generated 21,607 intrascaffold gaps where 
the mate-pair data suggested that the contigs 
should overlap, but no overlap was found. 
These gaps were defmed as a fixed 50 bp in 
length and make up 18.6% of the total 
1 16,442 gaps in the CSA assembly. 

We chose not to use the order of exons 
implied in cDNA or EST data as a way of 
ordering scaffolds. The rationale for not us- 
ing this data was that doing so would have 

•iased certain regions of the assembly by 
sarranging scaffolds to fit the transcript data 
and made validation of both the assembly and 
gene defuiition processes more difficult. 
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2.7 Assembly and validation analysis 

We analyzed the assembly of the genome 
from the perspectives of completeness 
(amount of coverage of the genome) and 
correctness (the structural accuracy of the 
order and orientation and the consensus se- 
quence of the assembly). 

Completeness. Completeness is defmed as 
the percentage of the euchromatic sequence 
represented in the assembly. This cannot be 
kno^yn with absolute certainty until the eu- 
chromatin sequence has been completed. 
However, it is possible to estimate complete- 
ness on the basis of (i) the estimated sizes of 
intrascaffold gaps; (ii) coverage of the two 
published chromosomes, 21 and 22 (48, 49); 
and (iii) analysis of the percentage of an 
independent set of random sequences (STS 
markers) contained in the assembly. The 
whole-genome libraries contain heterochro- 
matic sequence and, although no attempt has 
been made to assemble it, there may be in- 
stances of ujiique sequence embedded in re- 
gions of heterochromatin as were observed in 
Drosophila (50, 51). 

The sequences of human chromosomes 21 
and 22 have been completed to high quality 
and published (48, 49). Although this se- 
quence served as input to the assembler, the 
finished sequence was shredded into a shot- 
gun data set so that the assembler had the 
opportunity to assemble it differently from 
the original sequence in the case of structural 
polymorphisms or assembly errors in the 
BAC data: . In particular, the assembler must 
be able to resolve repetitive elements at the 
scale of components (generally multimega- 
base in size), and so this comparison reveals 
the level to which the assembler resolves 
repeats. In certain areas, the assembly struc- 
ture differs from the published versions of 
chromosomes 21 and 22 (see below). The 
consequence of the flexibility to assemble 
"finished" sequence differently on the basis 
of Celera data resulted in an assembly with 
more segments than the chromosome 21 and 
22 sequences. We examined the reasons why 
there are more gaps in the Celera sequence 
than in chromosomes 21 and 22 and expect 
that they may be typical of gaps in other 
regions of the genome. In the Celera assem- 
bly, there are 25 scaffolds, each containing at 
least 10 kb of sequence, that collectively span 
94.3% of chromosome 21. Sixty-two scaf- 
folds span 95.7% of chromosome 22. The 
total length of the gaps remaining in the 
Celera assembly for these two chromosomes 
is 3.4 Mbp. These gap sequences were ana- 
lyzed by RepeatMasker and by searching 
against the entire genome assembly (52). 
About 50% of the gap sequence consisted of 
common repetitive elements identified by Re- 
peatMasker; more than half of the remainder 
was lower copy number repeat elements. 

A more global way of assessing complete- 



ness is to measure the content of an independent 
set of sequence data in the assembly. We com- 
pared 48,938 STS maricers from Genemap99 
(51) to the scaffolds. Because these maricers 
were not used in the assembly processes, they 
provided a truly independent measure of com- 
pleteness. ePCR (53) and BLAST (54) were 
used to locate STSs on the assembled genome. 
We found 44,524 (91%) of the STSs in the 
mapped genome. An additional 2648 markers 
(5.4%) were found by searching the unas- 
sembled data or "chaff.•^We identified 1283 
STS markers (2.6%) not found in either Celera 
sequence or BAC data as of September 2000, 
raising the possibility that these markers may 
not be of human origin. If that were the case, 
the Celera assembled sequence would represent 
93.4% of the human genome and the unas- 
sembled data 5.5%, for a total of 98.9% cover- 
age. Similarly, we compared CSA against 
36,678 TNG radiation hybrid markers (55a) 
using the same method We found that 32,371 
markers (88%) were located in the mapped 
CSA scaffolds, with 2055 markers (5.6%) 
found in the remainder. This gave a 94% cov- 
erage of the genome through another genofhe- 
wide survey. 

Correctness, Correctness is defmed as the 
structural and sequence accuracy of the as- 
sembly. Because tiie source sequences for the 
^ Celera data and the GenBank data are from 
different individuals, we could not directly 
compare the consensus sequence of the as- 
Table 4. Summary of scaffold mapping. Scaffolds 
were mapped to the genome with different levels 
of confidence (anchored scaffolds have the highest 
confidence; unmapped scaffolds have the lowest). 
Anchored scaffolds were consistently ordered by 
the WashU BAC map and GM99. Ordered scaf- 
folds were consistently ordered by at least one of 
the following: the WashU BAC map, GM99. or 
component tiling path. Bounded scaffolds had or- 
der conflicts between at least two of the external 
maps, but their placements were adjacent to a 
neighboring anchored or ordered scaffold. Un- 
mapped scaffolds had, at most, a chromosome 
assignment. The scaffold subcategories are given 
below each category. 
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scaffold 
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% 
Total 
length 


Anchored 


1,526 


1,860.676.676 


70 


Oriented 


1,246 


1.852.088.645 


70 


Unoriented 


280 


8.588,031 


0.3 


Ordered 


2,001 


369.235.857 


14 


Oriented 


839 


329.633.166 


12 


Unoriented 


1,162 


. 39.602,691 
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Bounded 


38,241 


368.753.463 


14 


Oriented 


7.453 


274.536.424 


10 


Unoriented 


30.788 


94,217.039 
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Unmapped 


11.823 


55.313.737 
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281 


2.505.844 
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11.542 


52,807.893 
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sembly against other finished sequence for 
determining sequencing accuracy at the nu- 
cleotide level, although this has been done for 
identifying polymorphisms as described in 
Section 6. The accuracy of the consensiis 
sequence is at least 99.96% on the basis of a 
statistical estimate derived from the quality 
values of the underlying reads. 

The structural consistency of the assembly 
can be measured by mate-pair analysis. In a 
correct assembly, every mated pair of se- 
quencing reads should be located on the con- 
sensus sequence with the correct separation , 
and orientation between the pairs. A pair is 
termed "valid" when* the reads are in the . 
correct orientation, and the distance between 
them is within the mean ± 3 standard devi- 
ations of the distribution of insert sizes of the 
library from which the pair was sampled. A 
pair is termed "misoriented" when the reads 
are not correctly oriented, and is termed "mis- 
separated" when the distance between the 
reads is not in the correct range but the reads 
are correctly oriented. The mean ± the stan- 
dard deviation of each library used by the 
assembler was determined as described 
above. To validate these, we examined all 
reads mapped to the finished sequence of ; 
chromosome 21 (48) and detennined how 
many incorrect mate pairs there were as a 
result of laboratory tracking errors and chi- 
merism (two different segments of the ge- 
nome cloned into the same plasmid), and how ; 
tight the distribution of insert sizes was for 



those that were correct (Table 5). The stan- 
dard deviations for all Celera libraries were 
quite small, less than 15% of the insert 
length, with the exception of a few 50-kbp 
libraries. The 2- and 10-kbp libraries con- 
tained less than 2% invalid mate pairs, where- 
as the 50-kbp libraries were somewhat higher 
(^10%). Thus, although the mate-pah: infor- 

. mation was not perfect, its accuracy was such 
that measuring valid, misoriented, and mis- 
separated pairs with respect to a given assem- 
bly was . deemed to be a reliable instrument 
for validation purposes, especially when sev- 

. eral mate pahs confirm or deny an ordering. 

The clone coverage of the genome was 
39X, meaning that any given base.pair was,; 
on average, contained in 39 clones or, equiv- 
alently, spanned by 39 mate-pau-ed reads. 
Areas of low clone coverage or areas with a 
high proportion of invalid mate pairs would 
indicate potential assembly problems. We 
computed the coverage of each base in the 
assembly by valid mate pairs (Table 6). In . 
summary, for scaffolds >30 kbp in length, 
less than 1% of the Celera assembly was in 
regions of less than 3 X clone coverage. Thus, 

. more than 99% of the assembly, including . 
order and orientation, is strongly supported 
by this measure alone. 
= We examined the locations and number of 
all misoriented and raisseparated mates. In 
addition to : doing tliis analysis on the CSA . 
assembly (as of 1 October 2000), we also 
performed a study of the PFP assembly as of 



5 September 2000 (30, 55b). In this latter 
case, Celera mate pairs had to be mapped to 
the PFP assembly. To avoid mapping errors 
due to high-fidelity repeats, the only pairs 
mapped were those for which both reads 
matched at only one location with less than 
6% differences. A threshold was set such that 
sets of five or more simultaneously mvalid 
mate pairs indicated ia potential breakpoint, 
where the construction of the two assemblies 
differed. The graphic comparison of the CSA 
chromosome 21 assembly with the published 
sequence (Fig. 6A) serves as a validation of 
this methodology. Blue tick marks in the 
panels indicate breakpomts. There were a 
similar (small) number of brieakpoints on 
both chromosome sequences. The exception 
was 12 sets of scaffolds in the Celera assem- 
bly (a total of 3% of the chromosome length 
in 212 single-contig scaffolds) that were 
mapped to the wrong positions because they 
were too small to be mapped rejiably. Figures 

6 and 7 and Table 6 illustrate the mate-pair 
differences and breakpoints between the two 
assemblies. There was a higher percentage of 
misoriented and misseparated mate pairs in 
the large-insert libraries (50 kbp and BAC 
ends) than in the small-insert libraries in both 
assemblies (Table 6). The large-insert librar- 
ies are more likely to identify discrepancies 
sunply because they span a larger segment of 
the genome. The graphic . comparison be- 
tween the two assemblies for chromosome 8 
(Fig.' 6, B and C) shows that there are many 



Table 5. Mate-pair validation. Celera fragment sequences were mapped to 
the published sequence of chromosome 21. Each mate pair uniquely 
mapped was evaluated for correct orientation and placement (number 



of mate pairs tested). If the two mates had incorrect relative orienta- 
tion or placement they were considered invalid (number of invalid mate 
pairs). 



Library 
type 



2 kbp 
10 kbp 

50 kbp 



BES 



Sum 



Library 
no. 



1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 

15 
16 
17 
18 
19 



Chromosome 21 



Genome 



Mean 
Insert 
size 
(bp) 



SD 
(bp) 



SD/ 
mean 
(%} 



2.081 
1.913 
2.166 
11385 
14,523 
.9.635 
10.223 
64.888 
53.410 
52.034 
52.282 
46,616 
55,788 
39,894 
48,931 
48,130 
106,027 
160.575 
164,155 



106 
152 
175 
851 
1,875 
1,035 
928 
2,747 
5,834 
7,312 
7,454 
7,378 
10.099 
5.019 
9,813 
4.232 
27.778 
54,973 
19,453 



5.1 
7.9 
8.1 
7.5 
12.9 
10.7 
9.1 
4.2 
10.9 
14.1 
14.3 
15.8 
18.1 
12.6 
20.1 
8.8 
26.2 
34.2 
11.9 



No. of 
mate 
pairs 
tested 



3.642 
28,029 
4,405. 
4.319 
7,355 
5,573 
34,079 
16 
914 
5,871 
2,629 
2,153 
2,244 
199 
144 
195 
330 
155 
642 
102.894 



No. of 
invalid 
mate 
pairs 


% 
invalid 


Mean 
insert 
size (bp) 


SD 
(bp) 


SD/ 
mean 
(%) 


38 


1.0 


2,082 


90 


4.3 


413 


1.5 


1.923 


118 


6.1 


57 


1.3 


2,162 


158 


7.3 


80 


1.9 


11,370 - 


696 


6.1 


156 


2.1 


14.142 


1,402 


9.9 


109 


2.0 


9,606 


934 


9.7 


399 


1.2 


10.190 


777 


7.6 


1 


6.3 


65,500 


5,504 


8.4 


170 


18.6 


53.311 


5,546 


10.4 


569 


9.7 


51,498 


6.588 


12.8 


213 


8.1 


52.282 


7,454 


14.3 


215 


10.0 


45,418 


9,068 


20.0 


249 


11.1 


53.062 


10,893 


20.5 


7 


3.5 


36,838 


9,988 


27.1 


10 


6.9 


47,845 


4.774 


10.0 


14 


7.2 


47,924 


4,581 


9.is 


16 


4.8 


152,000 


26.600 


17.5 


8 


5,2 


161,750 


27,000 


16.7 


44 


6.9 


176,500 


19,500 


11.05 


2,768 


2.7 









(mean = 2.7) 
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gene boundaries. During this process, multiple 
hits to the same region were collapsed to a 
coherent set of data by tracking the coverage of 
a region. For example, if a group of bases was 
represented by multiple overlapping ESTs, the 
union of these regions matched by the set of 
ESTs on the scaffold was marked as being 
supported by EST evidence. This resulted, in a 
series of "gene bins " each of which was be- 
lieved to contain a single gene. One weakness of 
this initial implementation of the algorithm was 
in predicting gene boundaries in regions of tan- 
demly duplicated genes. Gene clusters frequent- 
ly resulted in homologous neighboring genes 
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being joined together, resulting in an annotation 
that artificially concatenated these gene models. 

Next, known genes (those with exact match- 
es of a flill-length cDNA sequence to the ge- 
nome) were identified, and the region conre- 
sponding to the cDNA was annotated as a 
predicted transcript. A subset of the curat- 
ed human gene set RefSeq from the Nation- 
al Center for Biotechnology Information 
(NCBI) was included as a data set searched in 
the computational pipeline. If a RefSeq tran- 
script matched the genome assembly for at least 
50% of its length at >92% identity, then the 
SIM4 (63) alignment of the RefSeq transcript to 



the region of the genome under analysis was 
promoted to the status of an Otto annotation. 
Because the genome sequence has gaps and 
sequence errors such as fiameshifls, it was not 
always possible to predict a transcript that 
agrees precisely with the experimentally deter- 
mined cDNA sequence. A total of 6538 genes 
in our inventory were identified and transcripts 
predicted in this way. 

Regions that have a substantial amount of 
sequence similarity, but do not match known 
genes, were analyzed by that part of the Otto 
system that uses the sequence similarity in- 
formation to predict a transcript. Here, Otto 
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Fig. 6. Comparison of the CSA and the PFP assembly. 
(A) All of chromosome 21, (B) all of chromosome 8, 
and (C) a 1-Mb region of chromosome 8 representing 
a single Cetera scaffold. To generate the figure, Celera 
fragment sequences were mapped onto each assem- 
bly. The PFP assembly is indicated in the upper third 
of each panel; the Celera assembly is indicated in the 
lower third. In the center of the panel, green lines 
show Celera sequences that are in the same order and 
orientation in both assemblies and form the longest 
consistently ordered run of sequences. Yellow lines 
indicate sequence blocks that are in the same orien- 
tation, but out of order. Red lines indicate sequence 
blocks that are not in the same orientation. For 
clarity, in the latter two cases, lines are only drawn 
between segments of matching sequence that are at 
least 50 kbp long. The top and bottom thirds of each 
panel show the extent of Celera mate-pair violations 
(red, misoriented: yellow, incorrect distance between 
the mates) for each assembly grouped by library size. 
(Mate pairs that are within the correct distance, as 
expected from the mean library insert size, are omit- 
ted from the figure for clarity.) Predicted breakpoints, 
corresponding to stacks of violated mate pairs of the 
same type, are shown as blue ticks on each assembly 
axis. Runs of more than 10,000 Ns are shown as cyan 
bars. Plots of all 24 chromosomes can be seen in Web 
fig. 3 on Science Online at www.sciencemag.org/cgi/ 
content/full/291/5507/1304/DC1. 



50.0 Mbp 



100.0 Mbp 



Mbp 9.7 Mbp 




5.1 Mtp 5.2 Mbp 5 J Mbp 5.4 Nfbp 5.5 Mbp 5.6 Mbp 5.7 Mbp 5.8 Mbp 



1318 



16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org 



THE HUMAN GENOME 



evaluates evidence generated by the compu- 
tational pipeline, corresponding to conserva- - 
tion between mouse and human genomic 
DNA, similarity to human transcripts (ESTs 



and cDNAs), similarity to rodent transcripts 
(ESTs and cDNAs), and similarity of the 
translation of human genomic DNA to known 
proteins to predict potential genes in the hu- 



man genome. The sequence from the region 
of genomic DNA contained in a gene bin \^'as 
extracted, and the subsequences supported by 
any homology evidence were marked (plus 100 
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Fig. 7. Schematic view of the distribution of breakpoints and large gaps 
on all chromosomes. For each chromosome, the upper pair of lines 
represent the PFP assembly, and the lower pair of lines represent Celera's 
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assembly. Blue tick marks represent breakpoints, whereas red tick marks 
represent a gap of larger than 10,000 bp. The number of breakpoints per 
chromosome is indicated in black, and the chromosome numbers in red. 
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bases flanking these regions). The other bases 
in the region, those not covered by any homol- 
ogy evidence, were replaced by N's. This se- 
quence segment, with high confidence , regions 
represented by the consensus genomic se- 
quence and the remainder represented by N's, 
was then evaluated by Genscan to see if a 
consistent gene model could be generated. This 
procedure simplified the gene-prediction task 
by first establishing the boundary for the gene 
(not a strength of most gene-finding algo- 
rithms), and by eliminating regions with no 
supporting evidence. If Genscan returned a 
plausible gene model, it was fiirther evaluated 
before being promoted to aii "Otto" annotation. 
The final Genscan predictions were often quite 
different from the prediction that Genscan re- 
turned on the same region of native genomic 
sequence. A weakness of using Genscan to 
refine the gene model is the loss of valid, small 
exons from the final annotation. 

The next step in defining gene structures 
based on sequence similarity was to compare 
each predicted transcript with the homology- 
based evidence that was used in previous steps 
to evaluate the depth of evidence for each exon 
in the prediction. Internal exons were consid- 
ered to be supported if they were covered by 
homology evidence to within ±10 bases of 
their edges. For first and last exons, the internal 
edge was required to be within 10 bases, but the 
external edge was allowed greater latitude to 
allow for 5' and 3' untranslated regions 
(UTRs). To be retained, a prediction for a 
muld-exon gene must have evidence such that 
the total number of "hits," as defined above, 
divided by the number of exons in the predic- 
tion must be >0.66 or must correspond to a 
RefSeq sequence. A single-exon gene must be 
covered by at least three supporting hits (±10 
bases on each side), and these must cover the 
complete predicted open reading frame. For 
a single-exon gene, we also required that 
the Genscan prediction include both a start 
and a stop codon. Gene models that did not 
meet these criteria were disregarded, and 

Table 7. Sensitivity and specificity of Otto and 
Genscan. Sensitivity and specificity were calculat- 
ed by first aligning the prediction to the published 
RefSeq transcript, tallying the number {N) of 
uniquely aligned RefSeq bases. Sensitivity is the 
ratio of N to the length of the published RefSeq 
transcript. Specificity is the ratio of N to the 
length of the prediction. All differences are signif- 
icant (Tukey HSD; P < 0.001). 



Method 


Sensitivity 


Specificity 


Otto (RefSeq only)* 


0.939 


0.973 


Otto (homology)t 


0.604 


0.884 


Genscan 


0.501 


0.633 



•Refers to those annotations produced by Otto using only 
the Sim4-poU$hed RefSeq alignment rather than an evi- 
dence-based Genscan prediction. t Refers to those 
annotations produced by supplying all available evidence 
to Censcaa 
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those that passed were promoted to Otto 
predictions. Homology-based Otto predic- 
tions do not contain 3' and 5' untranslated 
sequence. Although three de novo gene-finding 
programs [GRADL, Genscan, and FgenesH 
(63)] were run as part of the computational 
analysis, the results of these programs were not 
directly used in making the Otto predictions. 
Otto predicted 11,226 additional genes by 
means of sequence similarity. 

3.2 Otto validation 

To validate the Otto homology-based process 
and the method that Otto uses to define the 
structures of known genes, we compared tran- 
scripts predicted by Otto with their correspond- 
ing (and presumably correct) transcript from a 
set of 4512 RefSeq transcripts for which there 
was a imique SIM4 alignment (Table 7). In 
order to evaluate the relative perfonmance of 
Otto and Genscan, we made three comparisons. 
The first involved a determination of the accu- 
racy of gene models predicted by Otto with 
only homology data other than the correspond- 
ing RefSeq sequence (Otto homology in Table 
7). We measured the sensitivity (correctly pre- 
dicted bases divided by the total length of the 
cDNA) and specificity (con-ectly predicted 
bases divided by the sum of the correctly and 
incorrectly predicted bases). Second, we exam- 
ined the sensitivity and specificity of the Otto 
predictions that were made solely with the Ref- 
Seq sequence, which is the process that Otto, 
uses to annotate known genes (Otto-RefSeq). 
And third, we detennined the accuracy of the 
Genscan predictions corresponding to these 
RefSeq sequences. As expected, the alignment 
method (Otto-RefSeq) was the most accurate, 
and Otto-homology performed better than Gen- 
scan by both criteria. Thus, 6. 1% of true RefSeq 
nucleotides were not represented in the Otto- 
refseq annotations and 2.7% of the nucleotides 
in the Otto-RefSeq transcripts were not con- 
tained in the original RefSeq transcripts. The 
discrepancies could come from legitimate 
differences between the Celera assembly 
and the RefSeq transcript due to polymor- 
phisms, incomplete or incorrect data in the 
Celera assembly, errors introduced by Sim4 
during the alignment process, or the pres- 
ence of alternatively spliced forms in the 
data set used for the comparisons. 

Because Otto uses an evidence-based ap- 
proach to reconstruct genes, the absence of 
experimental evidence for intervening exons 
may inadvertantly result in a set of exons that 
cannot be spliced together to give rise to a 
transcript In such cases. Otto may "split genes" 
when in fact all the evidence should be com- 
bined into a single transcript We also examined 
the tendency of these methods to incorrectly 
split gene predictions. These trends are shown 
in Fig. 8. Both RefSeq and homology-based 
predictions by Otto split known genes into few- 
er segments than Genscan alone. 



3.3 Gene number 

Recognizing that the Otto system is quite 
conservative, we used a different gene-pre- 
diction strategy in regions where the ho- 
mology evidence was less strong. Here the 
results of de novo gene predictions were 
used. For these genes, we insisted that a 
predicted transcript have at least two of the 
following types of evidence to be included 
in the gene set for further analysis: protein, 
human EST, rodent EST, or mouse genome 
fragment matches. This fmal class of pre- 
dicted genes is a subset of the predictions 
made by the three gene-fmding programs 
that were used in the computational pipe- 
line. For these, there, was not sufficient 
. sequence similarity information for Otto to 
attempt to predict a gene structure. The 
three de novo gene-finding programs re- 
sulted in about 155,695 predictions, of 
which ^76,410 were nonredundant (non- 
overlapping with one another). Of these, 
57,935 did not overlap kriown genes or 
predictions made by Otto. Only 21,350 of 
the gene predictions that did not overlap 
Otto predictions were partially supported 
by at least one type of sequence similarity 
evidence, and 8619 were partially support- 
ed by nvo types of evidence (Table 8). 

The sum of this number (21,350) and the 
number of Otto annotations (17.764), 39,114. 
is near the upper limit for the human gene 
complement. As seen in Table 8, if the re- 
quirement for other, supporting evidence is 
made more stringent, this number drops rap- 
idly so that demanding two types of evidence 
reduces the total gene number to 26,383 and 
demanding three types reduces it to --23,000. 
Requiring that a prediction be supported by 
all four categories of evidence is too stringent 
because it would eliminate genes that encode 
novel proteins (members of currently unde- 
scribed protein families). No correction for 
pseudogenes has been made at this point in 
the analysis. 

In a further attempt to identify genes that 
were not found by the autoarmotation process 
or any of the de novo gene finders, we ex- 
amined regions outside of gene predictions 
that were similar to the EST sequence, and 
where the EST matched the genomic se- 
quence across a splice junction. After correct- 
ing for potential 3' UTRs of predicted genes, 
about 2500 such regions remained. Addition 
of a requirement for at least one of the fol- 
lowing evidence. types— homology to mouse 
genomic sequence fragments, rodent ESTs. 
or cDNAs— or similarity to a known protein 
reduced this number to 1010. Adding this to 
the numbers from the previous paragraph 
would give us estimates of about 40,000, 
27.000, and 24,000 potential genes in the 
human genome, depending on the stringency 
of evidence considered. Table 8 illustrates the 
number of genes and presents the degree of 
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confidence based on the supporting evidence. 
Transcripts encoded by a set of 26,383 genes 
were assembled for further analysis. This set 
includes the 6538 genes predicted by Otto on 
the basis of matches to known genes, 1 1 ,226 
transcripts predicted by Otto based on homol- 
ogy evidence, and 8619 from the subset of 
transcripts from de novo gene-prediction pro- 
grams tiiat have two types of supporting ev- 
idence. The 26,383 genes are. illustrated along 
chromosome diagrams in Fig; l. These are a 
very preliininary set of annotations arid are 
subject to all the limitations of an automated 
process. Considerable refinement is still nec- 
essary to improve the accuracy of these tran- 
script predictions. All the predictions and 
descriptions of genes and the associated evi- 
dence that we present are the product of 
completely computational processes, not ex- 
pert curation. We have attempted to enumer- 
ate the genes in the human genome in such a 
way that we have different levels of confi- 
dence based on the amoimt of supporting 
evidence: known genes, genes with good pro- 
tein or EST homology evidence, and de novo 
gene predictions confumed by modest ho- 
mology evidence. 

3,4 Features of human gene 
transcripts 

We estimate the average span for a "typi- 
cal" gene in the human DNA sequence to 
be about 27,894 bases, this is based on the 
average span covered by RefSeq tran- 
scripts, used because it represents our high- 
est confidence set. 

The set of transcripts promoted to gene 
armotations varies in a number of ways. As 
can be seen from Table 8 and Fig. 9, tran- 
scripts predicted by Otto tend to be longer, 
having on average about 7.8 exons, whereas 
those promoted from gene-prediction pro- 
grams average about 3.7 exons. The largest 
number of exons that we have identified in a 
transcript is 234 in the titin mRNA. Table 8 
compares the amounts of evidence that sup- 
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port the Otto and other predicted transcripts. 
For example, one can see that a typical Otto 
transcript has 6.99 of its 7.81 exons supported 
by protein homology evidence. As would be 
expected, the Otto transcripts generally have * 
more support than do transcripts predicted by 
the de novo methods. 

4 Genome Structure 

Summary, This section describes several of 
the honcoding attributes of the assembled ; 
genome sequence and their correlations with 
the predicted gene set. These include an anal- 
ysis of G+C content and gene density in the 
context of cytogenetic maps of the genome, 
an enumerative analysis of CpG islands, and 
a brief description of the genome-wide repet- 
itive elements. 
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4.1 Cytogenetic maps 

Perhaps the most obvious, and certainly the 
most, visible, element of the structure of 
the genome is the banding pattern produced 
by Giemsa stain. Chromosomal banding 
studies have revealed that about 17% to 
20% of the human chromosome comple- 
ment consists of C-bands, or constitutive 
heterochromatin (5^). Much of this hetero- 
. chromatin is highly polymorphic and con- 
sists oiF different families of alpha satellite 
DNAs with various higher order repeat 
structures (55). Many chromosomes have 
complex inter- and intrachromosomal du- 
plications present in pericentromeric re- 
gions (66). About 5% of the sequence reads 
were identified as alpha satellite sequences; 
these were not included in the assembly. 



□ Otto (homology) 

□ Otto (RefSeq only) 

□ Genscan . 



rfl. n. n. n. n. 



7 8 9 10 11 12 13 14 15 16 17 



Number of predictions per RefSeq transcript 

Fig. 8. Analysis of split genes resulting from different annotation methods. A set of 4512 
Sim4-based alignments of RefSeq transcripts to the genomic assembly were chosen (see the text 
for criteria), and the numbers of overlapping Genscan, Otto (RefSeq only) annotations based solely 
on Sim4-polished RefSeq alignments, and Otto (homology) annotations (annotations produced by 
supplying all available evidence to Genscan) were tallied. These data. show the degree to which 
multiple Genscan predictions and/or Otto annotations were associated with a single RefSeq 
transcript. The zero class for the Otto-homology predictions shown here indicates that the 
Otto-homology calls were made without recourse to the RefSeq transcript, and thus no Otto call 
was made because of insufficient evidence. 



Table 8. Numbers of exons and transcripts supported by various types of evidence for Otto and de novo gene prediction methods. Highlighted cells indicate 
the gene sets analyzed in this paper (boldface, set of genes selected for protein analysis: italic, total set of accepted de novo predictions). - 







Total 




Types of evidence 






No. of lines of evidence* 








Mouse 


Rodent 


Protein 


Human 


&1 


^2 


&3 


2:4 


Otto 


Number of 


17.959 


17,065 


14,881 


15,477 


16,374 


17,968t 


17.501 


15,877 


12,451 




transcripts 
Number of 


141,218 


111,174 


89,569 


108.431 


118,869 


140,710 


127.955 


99,574 


59.804 


De novo 


exons 
Number of 


58,032 


14,463 


5.094 


8.043 


9,220 


2h350 


8.619 


4,947 


1.904 




transcripts 
Number of 


319,935 


48,594 


19,344 


26,264 


■ 40,104 


79,148 


31,130 


17.508 


6,520 


No. of exons per 
transcript 


exons 
Otto 
De novo 


7.84 
5.53 


5.77 
3.17 


6.01 
3.80 


6.99 
3.27 


7.24 
4.36 


7.81 
3.7 


7.19 
3.56 


6.00 
3.42 


4.28 
3.16 



'four wnas or eviaence ^conservauon in mouse genomic ur>i>s simiwniy w iiunwi w. v^^.^r^ - w .^^w... — r Yj* j • * *tu* 

considered to support gene predictions from the different methods. The use of evidence is quite liberal requiring only a partial match to a single exon of predicted transaipt T » n>s 
number Includes alternative splice fomis of the 17.764 genes mentioned elsewhere In the text 
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Examination of pericentromeric regions is 
ongoing. 

The remaining —80% of the genome, the 
euchromatic component, is divisible into G-, 
and T-bands (67). These cytogenetic bands 
have been presumed to differ in their nucleotide 
composition and gene density, although we 
have been unable to determine precise band 
boundaries at the molecular level T-bands are 
the most G+C- and gene-rich, and G-bands are 
G+C-poor (68), Bernardi has also offered a 
description of the euchromatin at the molecular 
level as long stretches of DNA of differing base 
composition, termed isochores (denoted L, HI, 
H2, and H3), which are >.300 kbp .in length 
(69). Bieraardi defined the L (light) isochores as 
G+C-poor (<43%), whereas the H (heavy) 
isochores fall into three G+C-rich classes rep- 
resenting 24, 8, and 5% of the genome. Gene 
concentration has been claimed to be very low 
in the L isochores and 20-fold more enriched in 
the H2 and H3 isochores (70). By examining 
contiguous 50-kbp windows of G+C content 
across the assembly, we found that regions of 
G+C content >48% (H3 isochores) averaged 
273.9 kbp in length, those with G+C content 
between 43 and 48% (HI +H2 isochores) aver- 
aged 202.8 kbp in length, and the average span 
of regions with <43% (L isochores) was 
1078.6 kbp. The correlation between G+C 
content and gene density was also examined in , . 
50-kbp windows along the assembled sequence 
(Table 9 and Figs. 10 and 11). We found that 
tiie density of genes was greater in regions of 
high G+C than in regions of low G+C content, 
as expected. However, the correlation between 
G+C content and gene density was not as 
skewed as previously predicted (69). A higher 
proportion of genes were located in the G+C- 
poor regions than had been expected. 

Chromosomes 17, 19, and 22, which have 
a disproportionate number of H3 -containing 
bands, had the highest gene density (Table 
10). Conversely, of the chromosomes that we 



found to have the lowest gene density, X, 4, 
18, 13, and Y, also have the fewest H3 bands. 
Chromosome 15, which also has few H3 
bands, did not have a particularly low gene 
density in our analysis. In addition, chromo- 
some 8, which we found to have a low gene 
density, does not appear to be imusual in its 
H3 banding. 

. How. valid is Ohno's postulate (77) that 
mammalian genomes consist of oases of genes 
in otherwise essentially empty deserts? It ap- 
pears that the himian genome does indeed con- 
tain deserts, or large, gene-poor regions. If we 
define a desert as a region >500 kbp without a 
. gene, then we see that 605 Mbp, or about 20% 
of the . genome, is in deserts. These are not 
uniformly distributed over the various chromo- 
somes. Gene-rich chromosomes 17, 19, and 22 
have only about 12% of their collective 171 
Mbp in deserts, whereas gene-poor chromo- 
somes 4, 13, 1 8, and X have 27.5% of their 492 
Mbp in deserts (Table 1 1). The apparent lack of 
predicted genes in these regions does not nec- 
essarily imply that they are devoid of biological 
flmctiorL 

4.2 Linkage map 

Linkage maps provide the basis for genetic 
analysis and are widely used m the study of the 
inheritance of traits and in the positional clon- 
ing of genes. The distance ihetriCi centimorgans 
(cM), is based on the recombination rate be- . 
tween homologous chromosomes during meio- 

Table 9. Characteristics of G+C in isochores. 



sis. In general, the rate of recombination in 
females is greater than that in males, and this 
degree of map expansion is not xmiform across 
the genome (72). One of the opportunities en- 
abled by a nearly complete genome sequence is 
to produce the ultimate physical map, and to 
fully analyze its correspondence with two other 
maps that have been widely used in genome 
. and genetic analysis: the linkage map and the 
cytogenetic map. This would close the loop 
between the mapping and sequencing phases of 
the genome project. 

We mapped the location of the markers 
that constitute the Genethon linkage map to 
the genome. The rate of recombination, ex- 
pressed as cM per Mbp, was calculated for 
3-Mbp windows as shown in Table 12. High- 
er rates of recombination in the telomeric 
region of the chromosomes have been previ- 
ously documented (73). From this mapping 
result, there is a difference of 4.99 between 
lowest rates and highest rates an^i the largest 
difference of 4.4 between males and females 
(4.99 to 0.47 on chromosome 16). This indi- 
cates that the variability in recombination 
rates among regions of the genome exceeds 
the differences in recombination rates be- 
tween males and females. The human ge- 
nome has recombination hotspots, where re- 
combination rates vary fivefold or more over 
a space of 1 kbp, so the picture one gets of the 
magnitude of variability in recombination 
rate will depend on the size of the window 



Isochore 


G+C {%) 


Fraction of genome 


Fraction of genes 


Predicted* 


Observed 


Predicted* 


Observed 


H3 


>48 


5 


9,5 


37 


24.8 


H1/H2 


43-48 


25 


21.2 


32 


26.6 


L 


<43 


67 


69,2 


31 


48.5 



*The predictions were based on Bemardi's definitions (70) of the isochore structure of the human genome. 



Fig. 9. Comparison of 
the number of exons 
per transaipt between 
the 17,968 Otto tran- 
scripts and 21350 de 
novo transcript predic- 
tions with at least one 
line of evidence that 
do not overlap with an 
Otto prediction. Both . 
sets have the highest 
number of transaipts 
In the two-exon cate- 
gory, but the de novo 
gene predictions are 
skewed much more 
toward smaller tran- 
scripts. In the Otto set. 
19.7% of the tran- 
scripts have one or 
two exons, and 5.7% 



7000 




m No. of otto 
transcripts 

13 No. of de novo + 
1 line of evidence 



43- 



. n . 1-1 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 >20 

Number of exons per transcript 
have more than 20. In the de novo set 49.3% of the transaipts have one or two exons, and 0.2% have more than 20. 
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exammed. Unfortunately, too few meiotic 
crossovers have occurred in Centre d 'Etude 
du Polymorphism Humain (CEPH) and other 
■eference families to provide a resolution any 
ler than about 3 Mbp. The next challenge 
will be to determine a sequence basis of 
recombination at the chromosomal level. An 
accurate predictor for the rate for variation in 
recombination rates between any pair of 
markers would be extremely useful in design- 
ing markers to narrow a region of linkage, 
such as in positional cloning projects. . 

4.3 Correlation between CpG islands 
and genes 

CpG islands are stretches of unmethylated 
DNA with a higher frequency of CpG 
dinucleotides when compared with the entire 
genome (74), CpG islands are believed to 
preferentially occur at the transcriptional start 
of genes, and it has been observed that most 
housekeeping genes have CpG islands at the 
5' end of the transcript (7J, 76). In addition, 
experimental evidence indicates that CpG is- 
land methylation is correlated with gene in- 
activation (77) and has been shown to be 
important during gene imprinting (78) and 
tissue-specific gene expression (79) 

Experimental methods have been used 
that resulted in an estimate of 30,000 to 
45,000 CpG islands in the human genome 
(74, 80) and an estimate of 499 CpG islands 
n human chromosome 22 (81). Larsen et 
\ (76) and Gardiner-Garden and Frommer 
X75) used a computational method to iden- 
tify CpG islands and defined them as re- 
gions of DNA of >200 bp that have a G+C 
content of >50% and a ratio of observed 
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versus expected frequency of CG dinucle- 
otide ^0.6. 

It is difficult to make a direct compari- 
son of experimental definitions of CpG is- 
lands with computational definitions be- 
cause computational methods do not con- 
sider the methylation state of cytosine and 
experimental methods do not directly select 
regions of high G+C content. However, we 
can determine the correlation of CpG island . 
, with gene, starts, given a set of armotated ^ 
genomic transcripts and the whole genome 
sequence. We have analyzed the publicly 
available annotation of chromosome 22, as 
well as using the entire human genome in 
our assembly and the computationally an- 
notated genes. A variation of the CpG is- 
land computation was compared with 
Larsen et aL (76). The main differences are 
that we use a sliding window of 200 bp, 
consecutive windows are merged only if 
they overlap, and we recompute the CpG 
value upon merging, thus rejecting any po- 
tential island if it scores less than the 
threshold. 

To compute various CpG statistics, we 
used two different thresholds of CG dinucle- 
otide likelihood ratio. Besides using the orig- 
inal threshold of 0.6 (method 1), we used a 
higher threshold of CG dinucleotide likeli- 
hood ratio of 0.8 (method 2), which results in 
the number of CpG islands on chromosome 
22 close to the number of annotated genes on 
this chromosome. The main results are sum- 
. marized in Table 13. CpG islands computed 
with method 1 predicted only 2.6% of the 
CSA sequence as CpG, but 40% of the gene 
starts (start codons) are contained inside a 




0% of genonne 
□ % of genes 



ei 



30-35% 35-40% 40-45% 45-50% 50-55% 55-60% 60-65% 

% G+C 

ig. 10. Relation between G+C content and gene density. The blue bars show the percent of the 
^nome (in 50-kbp windows) with the indicated G+C content The percent of the total number of 
enes associated with each C+C bin Is represented by the yellow bars. The graph shows that about 
5% of the genome has a C+C content of between 50 and 55%, but that this portion contains 
nearly 15% of the genes. 



CpG island. This is comparable to ratios re- 
ported by others (82), The last two rows of 
the table show the observed and expected 
average distance, respectively, of the closest 
CpG island from the first exon. The observed 
average closest CpG islands are smaller than 
the corresponding expected distances, con- 
firming an association between CpG island 
and the fu-st exon. .. . 
. We also looked at the distribution of CpG 
■ island nucleotide's among various sequence 
classes such as intergenic region's, introns, 
exons, and first exons. We computed the 
likelihood score for each sequence class as 
the ratio of the observed fraction of CpG 
island nucleotides in that sequence class 
and the expected fraction of CpG island 
nucleotides in that sequence class. The re- 
sult of applying method I on CSA were 
scores of 0.89 for intergenic region, 1.2 for 
intron, 5.86 for exon, and 13.2 for first 
exon. The same trend was also found for 
chromosome 22 and after the application of 
a higher threshold (method 2) on both data 
sets. In sum, genome-wide analysis has 
extended earlier analysis and suggests a 
strong correlation between CpG islands and 
first coding exons. 

4.4 Genome-wide repetitive elements 

The proportion of the genome covered by. 
various classes of repetitive DNA is present- 
ed in Table 14. We. observed about 35% of 
the genome in these repeat classes, very sim- 
ilar to values reported previously (83). Repet- 
itive sequence may be underrepresented in 
the Celera assembly as a result of incomplete 
repeat resolution, as discussed above. About 
8% of the scaffold length is in gaps, and we 
expect that much of this is repetitive se- 
quence. Chromosome 19 has the highest re- 
peat density (57%), as well as the highest 
gene density (Table 10). Of interest, among 
the different classes of repeat elements, we 
observe a clear association of Alu elements 
and gene density, which was not observed 
between LINEs and gene density. 

5 Genome Evolution 

Summary. The dynamic nature of genome 
evolution can be captured at several levels. 
These include gene duplications mediated by 
RNA intermediates (retrotransposition) and 
segmental genomic duplications. In this sec- 
tion, we document the genome-wide occur- 
rence of retrotransposition events generating 
functional (intronless paralogs) or inactive 
genes (pseudogenes). Genes mvolved in 
translational processes and nuclear regulation 
account for nearly 50% of all intronless para- 
logs and processed pseudogenes detected in 
our survey. We have also cataloged the extent 
of segmental genomic duplication and pro- 
vide evidence for 1077 duplicated blocks 
covering 3522 distinct genes. 
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Fig. n. Genome structural features. 
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Fig. 11 (continued). Relation among gene density (orange), G+C content 
(green), EST density (blue), and Alu density (pink) along the lengths of 
each of the chromosomes. Gene density was calculated in 1-Mbp win- 



dows. The percent of G+C nucleotides was calculated in 100- 
windows. The number of ESTs and Alu elements is shown per 100- 
window. 



5.1 Retrotransposition in the human 
genome 

Retrotransposition of processed mRNA 
transcripts into the genome results in func- 
tional genes, called intronless paralogs, or 
naclivated genes (pseudogenes). A paralog 
efers to a gene that appears in more than 
one copy in a given organism as a result of 



a duplication event. The existence of both 
intron-containing and intronless forms of 
genes encoding functionally similar or 
identical proteins has been previously de- 
scribed {84, 85). Cataloging these evolu- 
tionary events on the genomic landscape is 
of value in understanding the functional 
consequences of such gene-duplication 



events in cellular biology. Identification of 
conserved intronless paralogs in the mouse 
or other mammalian genomes should pro- 
vide the basis for capturing the evolution- 
ary chronology of these transposition 
events and provide insights into gene loss 
and accretion in the mammalian radiation. 
A set of proteins corresponding to all 901 
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Ono-predicted, single-exon genes were sub- 
jected to BLAST analysis against the proteins 
encoded by the remaining multiexon predict- 
ed transcripts. Using homology criteria of 
70% sequence identity over 90% of the 
length, we identified 298 instances of single- 
to multi-exon correspondence. Of these 298 
sequences, 97 were represented in the Gen- 
Bank data set of experimentally validated 
full-length genes at the stringency specified 
and were verified by manual inspection. : 
, We believe, that these 97 cases may rep- 
resent intronless paralogs (see Web table 1 on 
Science Online at www.sciencemag.org/cgi/ 
content/full/291/5507/1304/DCi) of known 
genes. Most of these are flanked by direct 
repeat sequences, although the precise nature 
of these repeats remains to be determined. All 
of the cases for which we have high confi- 
dence contain polyadenylated [poly(A)] tails 
characteristic of retrotransposition. 

Recent publications describing the phe- 
nomenon of fimctional intronless paralogs 
speculate that retrotransposition may serve as 
a mechanism used to escape X-chromosomal 
inactivation (84, 86). We do not find a bias 
toward X chromosome origination of these 
retrotransposed genes; rather, the results 
show a random chromosome distribution of 
both the intron-containing and corresponding 
intronless paralogs. We also have found sev- 
eral cases of retrotransposition firom a single 
source chromosome to multiple target chro- 
mosomes. Interesting examples include the 
retrotransposition of a five exon-containing 
ribosomal protein L21 gene on chromosome 
13 onto chromosomes 1, 3, 4, 7, 10, and 14, 
respectively. The size of the source genes can 
also show variability. The largest example is 
the 31-exon diacylglycerol kinase zeta gene 
on chromosome 11 that has an intronless 
paralog on chromosome 13. Regardless of 
route, retrotransposition with subsequent 
gene changes in coding or noncoding regions 
that lead to different fimctions or expression 
patterns, represents a key route to providing 
an enhanced functional repertoire in mam- 
mals {87). 

Our preliminary set of retrotransposed in- 
tronless paralogs contains a clear overrepre- 
sentatiori of genes involved in translational 
processes (40% ribosomal proteins and 10% 
translation elongation factors) and nuclear 
regulation (HMG nonhistone proteins, 4%), 
as well as metabolic and regulatory enzymes. 
EST matches specific to a subset of intronless 
paralogs suggest expression of these intron- 
less paralogs. Differences in the upstream 
regulatory sequences between the source 
genes and their intronless paralogs could ac- 
count for differences in tissucrspecific gene 
expression. Defining which, if any, of these 
processed genes are functionally expressed 
and translated will require further elucidation 
and experimental validation. 
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5.2 Pseudogenes 

A pseudogene is a nonfunctional copy that is 
very similar to a normal gene but that has 
been altered slightly so that it is not ex- 
Table 11. Genome overview. 



pressed. We developed a method for the pre- 
liminary analysis of processed pseudogenes 
in the human genome as a starting point in 
elucidating the ongoing evolutionary forces 



Size of the genome (including gaps) 

Siie of the genome (excluding gaps) 

Longest contig \ . . 
: Longest scaffold 
. Percent of A+T In the genome 

Percent of C+C in the genome 

Percent of undetermined bases in the genome 

Most GC-rich 50 kb 

Least CC-rich 50 kb 

Percent of genome classified as repeats 

Number of annotated genes 

Percent of annotated genes with unknown function 

Number of genes (hypothetical and annotated) 

Percent of hypothetical and annotated genes with unknown function 

Gene with the most exons 

Average gene size 

Most gene-rich chromosome 

Least gene-rich chromosomes 

Total size of gene deserts (>500 kb with no annotated genes) 
Percent of base pairs spanned by genes 
Percent of base pairs spanned by exons 
Percent of base pairs spanned by introns 
Percent of base pairs In intergenic DNA 

Chromosome with highest proportion of DNA in annotated exons 
Chromosome with lowest proportion of DNA in annotated exons 
Longest intergenic region (between annotated + hypothetical genes) 
Rate of SNP variation 

*ln these ranges, the percentages correspond to the annotated gene set (26, 383 genes) and the hypothetical + 
annotated gene set (39,114 genes), respectively. 



2.91 Cbp 
2.66 Cbp 
1.99 Mbp 
. 14.4 Mbp 

• 54* ■ 
38 
9 

Chr. 2 (66%) 
Chr. X (25%) 
35 

26,383 
42 

39.114 
59 

Titin (234 exons) 
27 kbp 

Chr. 19 (23 genes/Mb) 
Chr. 13 (5 genes/Mb), 
Chr. Y (5 genes/Mb) 
605 Mbp ^ 
25.5 to 37.8* 
1.1 to 1.4* 

24.4 to 36.4* 

74.5 to 63.6* 
Chr. 19 (9.33) 
Chr. Y (0,36) 

Chr. 13 (3.038,416 bp) 
1/1250 bp 



Table 12. Rate of recombination per physical distance (cM/Mb) across the genome. Cenethon markers 
were placed on CSA-mapped assemblies, and then relative physical distances and rates were calculated 
in 3-Mb windows for each chromosome. NA. not applicable. 



Male 



Chrom. 



Sex-average 



Female 





Max. 


Avg. 


Min. 


Max 


Avg. 


Min. 


Max. 


Avg. 


Min. 


1 


2.60 


1.12 


0.23 


2.81 


1.42 


0.52 


339 


1.76 


0,68 


2 


2.23 


0.78 


0.33 


2.65 


1.12 


0.54 


3.17 


1.40 


0.61 


3 


2.55 


0.86 


0.23 


2.40 


1.07 


0.42 


2.71 


130 


0.33 


4 


1.66 


0.67 


0.15 


2.06 


1.04 


0.60 


2.50 


1.40 


0.77 


5 


2.00 


0.67 


0.18 


1.87 


1.08 


0.42 


2.26 


1.43 


0.62 


6 


1.97 


0.71 


0.28 


2.57 


1,12 


0.37 


3.47 


1.67 


0.64 


7 


2.34 


1.16 


0.48 


1.67 


1.17 


0.47 


2.27 


1.21 


0.34 


8 


1.83 


0.73 


0,14 


2.40 


1.05 


0.46 


3.44 . 


1.36 


0.43 


9 


2.01 


0.99 


0.53 


1.95 


132 


0.77 


2.63 


'1.66 


0,82 


10 


3.73 


1.03 


0,22 


3.05 


1.29 


. 0.66 


2.84 


1.51 


0.76 


11 


1.43 


0.72 


0.31 


2.13 


0.99 


0.47 


3.10 


132 


0.49 


12 


4.12 


0.76 


0.26 


335 


1.16 


0.49 


2.93 


1.55 


0.59 


13 


1.60 


0.75 


0.01 


1.87 


0.95 


0.17 


2.49 


1.19 


032 


14 


3.15 


0.98 


0.18 


2.65 


130 
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. that account for gene inactivation. The gen- 
eral structural characteristics of these pro- 
cessed pseudogenes include the complete 
lack of intervening sequences found in the. 
functional counterparts, a poly(A) tract at the 
3' end, and direct repeats flanlcing the pseu- 
dogene sequence. Processed pseudogenes oc- 
cur as a result of retrotransposition, whereas 
improcessed pseudogenes arise from segmen- 
tal genome duplication. 

We searched the complete set of Otto- 
predicted transcripts against the genomic, se- 
quence by means of BLAST. Genomic re- 
gions corresponding to all Otto-predicted . 
transcripts were excluded from this analysis. 
We identified 2909 regions matching with - 
greater than 70% identity over at least 70% of 
the length of the transcripts that likely repre- 
sent processed pseudogenes. This number is 
probably an imderestimate because specific 
methods to search for pseudogenes were not 
used. 

. We looked for correlations between 
structural elements and the propensity for 
retrotransposition in the human genome. 
GC content and transcript length were com- 
pared between the genes with processed 
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pseudogenes (1177 source genes) versus 
the remainder of the predicted gene set. 
Transcripts that give rise to processed pseu- 
dogenes have shorter average transcript 
length (1027 bp versus 1594 bp for the Otto 
set) as compared with genes for which no 
pseudogene was detected. The overall GC 
. content did not- show any significant differ- 
ence, contrary to a recent report (88). There 
is a clear trend in gene families that are 
present as processed pseudogenes. These 
include ribosomal ' proteins (67%), lamin 
receptors (10%), translation elongation fac- 
tor alpha (5%), and HMG-non-hi stone pro- 
teins (2%). The increased occurrence of 
retrotransposition (both intronless paralogs 
and processed pseudogenes) among genes 
involved in translation and nuclear regula- 
tion may reflect an increased transcription- 
al activity of these genes. 

5.3 Gene duplication in the human 
genome 

Building on a previously published procedure 
(27), we developed a graph-theoretic algo- 
rithm, called Lek, for grouping the predicted 
human protein set into protein families (89). 



Table 13. Characteristics of CpC islands identified In chromosome 22 (34-Mbp sequence length) and the 
whole genome {2.9-Cbp sequence length) by means. of two different methods. Method 1 uses a CC 
likelihood ratio of SO.6. Method 2 uses a CG likelihood ratio of ^0.8. 
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Table 14. Distribution of repetitive DNA in the compartmentalized shotgun assembly sequence. 
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1.6 


Long terminal repeat (LTR) 
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5.6 


Long Interspersed nucleotide element 
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1025 
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The complete clusters that result from the 
Lek clustering provide one basis for compar- 
ing the role of whole-genome or chromosom- 
al duplication in protein family expansion as 
opposed to other means, such as tandem du- 
plication. Because each complete cluster rep- 
resents a closed and certain island of homol- 
ogy, and because Lek is capable of simulta- 
neously clustering protein complements of 
several organisms, the number of proteins 
contributed by each organism to a complete 
cluster can be predicted with confidence de- 
pending on the quality of the annotation of 
each genome. The variance of each organ- 
ism's contribution to each cluster can then be 
calculated, allowing an assessment of the rel- 
ative importance of large-scale duplication 
versus smaller-scale, organism-specific ex- 
pansion and contraction of protein families, 
presumably as a result of natural selection 
operating on individual protem families with- 
in an organism. As can be seen in Fig. 1 2, the 
large variance m the relative numbers of hu- 
man as compared with D. melanogaster and 
Caenorhabditis elegans proteins in complete 
clusters may be explained by multiple events 
of relative expansions in gene families in 
each of the three animal genomes. Such ex- 
pansions would give rise to the distribution 
that shows a peak at 1:1 in the ratio for 
human-worm or human-fly clusters with the 
slope spread covering both human and fly/ 
worm predominance, as we observed (Fig. 
12). Furthermore, there are nearly as many 
clusters where wonn and fly proteins pre- 
dominate despite the larger numbers of pro- 
teins in the human. At face value, this anal- 
ysis suggests that natural selection acting on 
individual protein families has been a major 
force driving the expansion of at least some 
elements of the human protein set. However, 
in our analysis, the difference between an 
ancient whole-genome duplication followed 
by loss, versus piecemeal duplication, cannot 
be easily distinguished. In order to differen- 
tiate these scenarios, more extended analyses 
were performed. 

5.4 Large-scale duplications 

Using two independent methods, we 
searched for large-scale duplications in the 
human genome. First, we describe a protein 
family-based method that identified highly ■ 
conserved blocks of duplication. We then 
describe our comprehensive method for identi- . 
fying all interchromosomal block duplications. 
The latter method identified a large number of 
duplicated chromosomal segments covering 
parts of all 24 chromosomes. 

The first of the methods is based on the 
idea of searching for blocks of highly con- 
served homologous proteins that occur in 
more than one location on the genome. For 
this comparison, two genes were considered 
equivalent if their protein products were de- 



1328 



16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org 



THE Human Genome 



termined to be in the same family and the 
same complete Lek cluster (essentially 
paralogous genes) (89). Initially, each chro- 
iosome was represented as a string of genes 
[ered by the start codons for predicted 
ines along the chromosome. We considered 
the two strands ias a single string, because 
local inversions are relatively common events 
relative to large-scale duplications. Each 
gene was indexed according to the protein 
family and Lek complete cluster {89). All 
pairs of. indexed gene strings . were then 
aligned m both the forwiard and reverse di- 
rections with the Smith-Waterman algorithm 
(90). A match between two proteins of the 
same Lek complete cluster was given a score 
of 10 and a mismatch —10, with gap open 
and extend penalties of -4 and -1. With 
these parameters, 19 conserved interchromo- 
somal blocks of duplication were observed, 
all of which were also detected and expanded 
by the comprehensive method described be- 
low. The detection of only a relatively small 
number of block duplications was a conse- 
quence of using an intrinsically conservative 
method grounded in the conservative con- 
straints of the complete Lek clusters. 

In the second, more comprehensive ap- 
proach, we aligned all chromosomes directly 
with one another using an algorithm based on 
the MUMmer system (P7). This alignment 
method uses a suffix tree data structure and a 
inear-time algorithm to align long sequences 
ry rapidly; for example, two chromosomes, 
f 100 Mbp can be aligned in less than 20 
min (on a Compaq Alpha computer) with 4 
gigabytes of memory. This proced\ire was 
used recently to identify numerous large- 
scale segmental duplications among the five 
chromosomes of A. thaliana {92)\ in that 
organism, the method revealed that 60% of 
the genome (66 Mbp) is covered by 24 very 
large duplicated segments. ¥ov Arabidopsis, a 
DNA-based aligiunent was sufficient to re- 
veal the segmental duplications between 
chromosomes; in the human genome, DNA 
alignments at the whole-chromosome level 
are insufficiently sensitive. Therefore, a mod- 
ified procedure was developed and applied, 
as follows. First, all 26,588 proteins 
(9,675,713 million amino acids) were concat- 
enated end-to-end in order as they occur 
along each of the 24 chromosomes, irrespec- 
tive of strand location. The concatenated pro- 
tein set was then aligned against each chro- 
mosome by the MUMmer algorithm. The 
resulting matches were clustered to extract all 
sets of three or more protein matches that 
occur in close proximity on two different 
chromosomes (93); these represent the can- 
didate segmental duplications. A series of 
liters were developed and applied to remove 
:ely false-positives from this set; for exam- 
ple, small blocks that were spread across 
many proteins were removed. To refme the 



filtering methods, a shuffled protein set was 
fu^t created by taking the 26,588 proteins, 
randomizing their order, and then partitioning 
them into 24 shuffled chromosomes, each 
containing the same number of proteins as the 
true genome. This shuffled protein set has the 
identical composition to the real genome; in 
particular, every protein and every domain 
appears the same number of times. The com- 
plete algorithm was then applied to both the 
• real and the shuffled data, with the results on 
the shuffled data being used to estimate the 
false-positive rate. The algorithm after filter- 
ing yielded 10,310 gene pairs in 1077 dupli- 
cated blocks containing 3522 distinct genes; 
tandemly duplicated expansions in many of 
the blocks explam the excess of gene pairs to 
distinct genes. In the shuffled data, by con- 
trast, only 370 gene pairs were found, giving 
a false-positive estimate of 3.6%. The most 
likely explanation for the 1077 block dupli- 
cations is ancient segmental duplications. .In 
many cases, the order of the proteins has been 
shuffled, although proximity is preserved. 
Out of the 1077 blocks, 159 contain only 
three genes, 137 contain four genes, and 781 
contain five or more genes. 

To illustrate the extent of the detected 
duplications, Fig. 13 shows all 1077 block 
duplications indexed to each chromosome in 
24 panels in which only duplications mapped 
to the indexed chromosome are displayed. 
The figure makes it clear that the duplications 
are ubiquitous in the genome. One feature 
that it displays is many relatively small chro- 
mosomal stretches, with one-to-many dupli- 
cation relationships that are graphically strik- 
ing. One such example captured by the anal- 
ysis is the well-documented olfactory recep- 
tor (OR) family, which is scattered in blocks 
throughout the genome and which has been 
analyzed for genome-deployment reconstruc- 
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tions at several evolutionary stages (94). The 
figure also illustrates that some chromo- 
somes, such as chromosome 2, contain many 
more detected large-scale duplications than 
others. Indeed, one of the largest duplicated 
segments is a large block of 33 proteins on 
chromosome 2, spread among eight smaller 
blocks in 2p, that aligns to a paralogous set on 
chromosome 14, with one rearrangement (see 
chromosomes 2 and 14 panels in Fig. 13). 
The proteins are not contiguous but span a 
region containing '97 proteins on chromo- 
' some 2 and 332 proteins on chromosome 14. 
The likelihood of observing this many dupli- 
cated proteins by chance, even over a span of 
this length, is 2.3 X 10"** (Pi). This dupli- 
cated set spans 20 Mbp on chromosome 2 and 
63 Mbp on chromosome 14, over 70% of the 
latter chromosome. Chromosome 2 also con- 
tains a block duplication that is nearly as 
large, which is shared by chromosome arm 2q 
and chromosome 12. This duplication incor- 
porates two of the four known Hox gene 
clusters, but considerably expands the extent 
of the duplications proximally and distally on 
the pair of chromosome arms. This breadth of 
duplication is also seen on the two chromo- 
somes carrying the other two Hox clusters. 

An additional large duplication, between 
chromosomes 18 and 20, serves as a good 
example to illustrate some of the features 
cornmon to many of the other observed large 
duplications (Fig. 13, inset).' This duplication 
contains 64 detected ordered intrachromo- 
somal pairs of homologous genes. After dis- 
counting a 40-Mb stretch of chromosome 1 8 
free of matches to chromosome 20, which is 
likely to represent a large insert (between the 
gene assignments "Krup rel" and "collagen 
rel" on chromosome 18 in Fig. 13), the full 
duplication segment covers 36 Mb on chro- 
mosome 18 and 28 Mb on chromosome 20. 
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Fig. 12. Gene duplication in complete protein clusters. The predicted protein sets of human, worm, 
and fly were subjected to Lek clustering (27), The numbers of clusters with varying ratios (whole 
number) of human versus worm and human versus fly proteins per cluster were plotted. 
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By this measure, the duplication segment 
spans nearly half of each chromosome's net 
length. The most likely scenario is that the 
whole span of this region was duplicated as a 
single very large block, followed by shuffling 
owing to smaller scale rearrangements. As 
such, at least four subsequent rearrangements 
would need to be invoked to explain the 
relative insertions and inversions seen in the 
duplicated segment interval. The 64 protein 
pairs in this alignment occur among 217 pro- 
tein assignments on chromosome 18, and 
among 322 protein assignments on chromo- 
some 20, for a density of involved proteins of 
20 to 30%. This is consistent with an ancient 
large-scale duplication followed by subse- 
quent gene loss on one or both chromosomes. 
Loss of just one member of a gene pair 
subsequent to the duplication would result in 
a failure to score a gene pair in the block; less 
than 50% gene loss on the chromosomes 
would lead to the duplication density ob- 
served here. As an independent verification 
of the significance of the alignments detect- 
ed, it can be seen that a substantial number of 
. the pairs of aligning proteins in this duplica- 
tion, including some of those annotated (Fig. 
13), are those populating small Lek complete 
clusters (see above). This indicates that they 
are members of very small families of para- 
logs; their relative scarcity within the genome 
validates the uniqueness and robust nature of 
their alignments. 

Two additional qualitative features were ob- 
served among many of the large-scale duplica- 
tions. First, several proteins with disease asso- 
ciations, with OMIM (Online Mendelian Inher- 
itance in Man) assignments, are members of 
duplicated segments (see web table 2 on Sci- 
ence Online at www.sciencemag.org/cgi/con- 
tent/full/291/5507/1304/DCl). We have also 
observed a few instances where paralogs on 
both duplicated segments are associated with 
similar disease conditions. Notable among 
these genes are proteins involved in hemostasis 
(coagulation factors) that are associated with 
bleeding disorders, transcriptional regulators 
like the homeobox proteins associated with de- 
velopmental disorders, and potassium channels 
associated with cardiovascular conduction ab- 
normalities. For each of these disease genes, 
closer study of the paralogous genes in the 
duplicated segment may reveal new insights 
into disease causation, with further investiga- 
tion needed to determine whether they might be 
involved in the same or similar genetic diseases. 
Second, although there is a conserved number 
of proteins and coding exons predicted for spe- 
cific large duplicated spans within the chromo- 
some 18 to 20 alignment, the genomic DNA of 
chromosome 18 in these specific spans is in 
some cases more than 10-fold longer than the 
corresponding chromosome 20 DNA. This se- 
lective accretion of noncoding DNA (or con- 
versely, loss of noncoding DNA) on one of a 



pair of duplicated chromosome regions was 
observed in many compared regions. Hypothe- 
ses to explain which mechanisms foster these 
processes must be tested. 

Evaluation of the alignment results gives 
some perspective on dating of the duplications. 
As noted above, large-scale ancient segmental 
duplication in fact best explains many of the 
blocks detected by this genome-wide analysis. 
The regions of human chromosomes involved 
in the large-scale duplications expanded upon 
above (chromosomes 2 to 14, 2 to 12, and 18 to 
20) are each syntenic to a distinct mouse chro- 
mosomal region. The corresponding mouse. 
■ . .chromosomal regions are much more similar in 
sequence conservation, and even in order, to 
their human synteny partners than the human 
duplication regions are to each other. Further, 
the corresponding mouse chromosomal regions 
each bear a significant proportion of genes or- 
thologous to ihc human genes on which the 
human duplication assigrmients were made. On 
the basis of these factors, the corresponding 
mouse chromosomal spans, at coarse resolu- 
tion, appear to be products of the same large- 
scale duplications observed in humans. Al- 
though further detailed analysis must be canried 
out once a more complete genome is assembled 
for mouse, the underlying large duplications 
appear to predate the two species' divergence. 
This dates the duplications, at the latest, before 
divergence of the primate and rodent lineages. 
This date can be further refined upon examina- . 
tion of the synteny between human chromo- 
somes and those of chicken, pufferfish {Fugu 
rubripes), or zebrafish {95). The only sub- 
stantial syntenic stretches mapped in these 
species corresponding to both pairs of human 
duplications are restricted to the Hox cluster 
regions. When the synteny of these regions 
(or others) to human chromosomes is extend- 
ed with further mapping, the ages of the 
nearly chromosome-length duplications seen 
in humans are likely to be dated to the root of 
vertebrate divergence. 

The MUMmer-based results demonstrate 
large block duplications that range in size from 
a few genes to segments covering most of a 
chromosome. The extent of segmental duplica- 
tions raises the question of whether an ancient 
whole-genome duplication event is the under- 
lying explanation for the numerous duplicated 
regions {96). The duplications have undergone 
many deletions and subsequent rearrangements; 
these events make it difficult to distinguish 
between a whole-genome duplication and mul- 
tiple smaller events. Further analysis, focused 
especially on comparing the estimated ages of 
all the block duplications, derived partially 
firom interspecies genome comparisons, will be 
necessary to determine which of these two hy- 
potheses is more likely. Comparisons of ge- 
nomes of different vertebrates, and even cross- 
phyla genome comparisons, will allow for the 
deconvolution of duplications to eventually re- 



veal the stagewise history of our genome, and 
with it a history of the emergence of many of 
the key functions that distinguish us from oiha 
living things. 

6 A Genome-Wide Examination of 
Sequence Variations 

Summary. Computational methods were used 
to identify single-nucleotide polymorphismi 
(SNPs) by comparison of the Celera sequence 
to other SNP resources. The SNP rate be- 
tween two chromosomes was -^1 per 1200 lo 
1500 bp. SNPs are distributed nonrandomly 
throughout the genome. Only a very small 
proportion of all SNPs (<1%) potentially 
impact protein function based on the func- 
tional analysis of SNPs that affect the pre- 
dicted coding regions. This results in an es- 
timate that only thousands, not millions, of 
genetic variations may contribute to the struc- 
tural diversity of human prpteins. 

Having a complete genome sequence enables 
researchers to achieve a dramaric acceleration 
in the rate of gene discovery, but only through 
analysis of sequence variation in DNA can we 
discover the genetic basis for variation in hcalilr 
among human beings. Whole-genome shotgun 
sequencing is a particularly effective mcthinl 
for detecting sequence variation in tandem with 
whole-genome asseihbly. In addition, we com- 
pared the . distribution ; and attributes of SNPs 
^ascertained by three other methods: (i) align- 
ment of the Celera consensus sequence lo (lie 
PFP assembly, (ii) overiap of high-quality reads 
of genomic sequence (refened to as "Kwok"; 
1,120,195 SNPs) {97), and (iii) reduced repre- 
sentation shotgun sequencing (referred to as 
*TSC"; 632,640 SNPs) {98). These data were 
consistent in showing an overall nucleotide di- 
versity of ^8 X 10""*, marked heterogeneity 
across the genome in SNP density, and an 
overwhelming preponderance of noncoding 
variation that produces no change in expressed 
proteins. 

6.1 SNPs found by aligning the Celera 
consensus to the PFP assembly 

Ideally, methods of SNP discovery make full 
use of sequence depth and quality at every silc, 
and quantitatively control the rate of false-pos- 
itive and false-negative calls with an explicit 
sampling model {99). Comparison of consensus 
sequences in the absence of these details neces- 
sitated a more ad hoc approach (quality .scores 
could not readily be obtained for the PFP a-s- 
sembly). First, all sequence differences between 
the two consensus sequences were identified; 
these were then filtered to reduce the contnbu- 
tion of sequencing errors and misassembly. As 
a measure of the effectiveness of the filtenng 



step, we monitored the ratio of transition 



and 



transversion substitutions, because a 2:1 ratio 
has been well documented as typical in mam- 
malian evolution {100) and in human SNI 
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{101, 102). The filtering steps consisted of re- 
moving variants where the quality score in the 
Celera consensus was less than 30 and where 
the density of variants was greater than 5 in 400 
hp. These filters resulted in shifting the transi- 
teon-to-transversion ratio from 1.57:1 to 
1.89:1. When applied to 2.3 Gbp. of alignments 
between the Celera and PFP consensus se- 
quences, these filters resulted in identification 
of 2,104,820 putative SNPs from a total of 
2,778,474 substitution differences. Overiaps 
between this set of SNPs and those found by 
other methods are described below. 
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6.2 Comparisons to public SNP 
databases 

Additional SNPs, including 2,536,021 fi:om 
dbSNP (www.ncbi.nlm.nih.gov/SNP) and 
13,150 from HGMD, (Human Gene Muta- 
tion Database, from the University of 
Wales, UK), were mapped on the Celera con- 
sensus sequence by a sequence similarity 
search with the program PowerBlast {103). The 
two largest data sets in dbSNP are the Kwok 
and TSC sets, with 47% and 25% of the dbSNP 
records. Low-^juality alignments with partial 
coverage of the dbSNP sequence and align- 
ments that had less than 98% sequence identity 
between the Celera sequence and the dbSNP 
flanking sequence were eliminated. dbSNP se- 
quences mapping to multiple locations on the 
Celera genome were discarded A total of 
2,336,935 dbSNP variants were mapped to 
"jj223,038 unique locations on the Celera sc- 
ience, implying considerable redundancy in 
"iibSNP. SNPs in the TSC set mapped to 
585,8 1 1 unique genomic locations, and SNPs m 
the Kwok set mapped to 438,032 unique loca- 
tions. The combined unique SNPs counts used 
in this analysis, including Celera-PFP, TSC, 
and Kwok, is 2,737,668. Table 15 shows that a 
substantial fi:action of SNPs identified by one of 
these methods was also found by another meth- 
od. The very high overlap (36.2%) between the 
Kwok and Celera-PFP SNPs may be due in part 
to the use by Kwok of sequences that went into 
the PFP assembly. The unusually low overiap 
(1 6.4%) between the Kwok and TSC sets is due 

Table 15. Overlap of SNPs from genome-wtde 
SNP databases. Table entries are SNP counts for 
each pair of data sets. Numbers in parentheses are 
the fraction of overlap, calculated as the count of 
overlapping SNPs divided by the number of SNPs 
in the smaller of the two databases compared. 
Total SNP counts for the databases are: Celera- 
PFP, 2.104.820; TSQ 585,811; and Kwok 438.032 
Only unique SNPs in the TSC and Kwok data sets 
were included. 



to their being the smallest two sets. In addition, 
24.5% of the Celera-PFP SNPs overlap with 
SNPs derived fi-om the Celera genome se- 
quences {46). SNP validation in population 
samples is an expensive and laborious process, 
so confirmation on multiple data sets may pro- 
vide an efficient initial validation "in silico" (by 
computational analysis). 

One means of assessing whether the 
three sets of SNPs provide the same picture 
; of humain variation is to tally the frequen- 
cies of the six possible base -changes in 
each set of SNPs (Table 16). Previous mea- 
sures of nucleotide diversity were mostly 
derived from small-scale analysis on can- 
didate genes {101), and our analysis with 
all three data sets validates the previous 
observations at the whole-genome' scale. 
There is remarkable homogeneity, between 
•the SNPs found in the Kwok set, the TSC 
set, and in our whole-genome shotgun {46) 
m.this substitution pattern. Compared with 
the rest of the data sets, Celera-PFP devi- 
ates slightly from the 2:1 transition-to- 
transversion ratio observed in the other 
SNP sets. This result is not unexpected, 
because some fraction of the computation- 
ally identified SNPs in the Celera-PFP 
comparison may in fact be sequence errors. 
A 2 : 1 transitionrtransversion ratio for the 
bona fide SNPs would be obtained if one 
assumed that 15% of the sequence differ- 
ences in the Celera-PFP set were a result of 
(presumably random) sequence errors. 

6.3 Estimation of nucleotide diversity 
from ascertained SNPs 

The number of SNPs identified varied 
widely across chromosomes. In order to 
normalize these values to the chromosome 
size and sequence coverage, we used ir, the 
standard statistic for nucleotide diversity 
{104). Nucleotide diversity is a measure, of 
per-site heterozygosity, quantifying the 
probability that a pair of chromosomes 
drawn from the population will differ at a 
nucleotide site. In order to calculate nucle- 
otide diversity for each chromosome, we 
need to know the number of nucleotide 
sites that were surveyed for variation, and 
in methods like reduced respresentation se- 
quencing, we need to know the sequence 
quality and the depth of coverage at each 



site. These data are not readily available, so 
we could not estimate nucleotide diversity 
from the TSC effort. Estimation of nucleo- 
tide diversity. from high-quality sequence 
overiaps should be possible, but again, 
more information is needed on the details 
of all the alignments. 

Estimation of nucleotide diversity from a 
shotgun assembly entails calculating for each 
column of the multialignment, the probability . 
that two or more distinct alleles are present, 
; and the probability of detecting a SNP if in 
fact the alleles have different sequence (i.e., 
the probability of correct sequence calls). The 
greater the depth of coverage and the higher 
the sequence quality, the higher is the chance 
of successfully detecting a SNP {105). Even 
after correcting for variation in coverage, the 
nucleotide diversity appeared to vary across 
autosomes. The significance of this heteroge- 
neity was tested by analysis of variance, with 
estimates of it for lOO-kbp windows to esti- 
mate variability within chromosomes (for the 
Celera-PFP comparison, F = 29 73 P < 

0,0001). ■ ; 

Average diversity for the autosomes es- 
timated from the Celera-PFP comparison 
was 8.94 X 10--*. Nucleotide diversity on 
the X chromosome was 6.54 X lO"'*. The 
X is expected to be less variable than au- 
tosomes, because for every four copies of 
autosomes in the population, there are only 
three X chromosomes, and this smaller ef- 
fective population size means that random 
drift will more rapidly remove variation 
from the X\106). 

Having' ascertained nucleotide variation 
genome-wide, it appears that previous esti- 
mates of nucleotide diversity in humans 
based on samples of genes were reasonably 
accurate (y(?7, 102, 106, 107). Genome-wide, 
our estimate of nucleotide diversity was 
8.98 X 10-4 Celera-PFP alignment, 

and a published estimate averaged over 10 
densely resequenced human genes was 
8.00 X 10-4 {108). 



6.4 Variation in nucleotide diversity 
across the human genome 

Such an apparently high degree of variabil- 
ity among chromosomes . in SNP density 
raises the qiiestion of whether there is het- 
erogeneity at a finer scale within chromo- 



Table 16. Summary of nucleotide changes In different SNP data sets. 



SNP data set 





TSC 


Kwok 


Celera-PFP 


188.694 


158.532 




(0.322) 


(0.362) 






72,024 






(0.164) 



Celera-PFP 

Kwok* 

TSCt 



A/C 


at 


A/C 


A/T 


(%) 


(%) 


(%) 


(%) 


30.7 


30.7 


10.3 


8.6 


33.7 


. 33.8 


8.5 


7.0 


33.3 


33.4 


8.8 


7.3 



C/G 
(%) 



T/C 
{%) 



Transition: 
transversion 



9.2 
8.6 
8.6 



10.3 
8.4 
8.6 



1.59:1 
2.07:1 
1.99:1 



sIlo^T^Inn^^^^^ °^ the NCBI database dbSNP (www.nci.nlm.nih.gov/SNP/} with the method defined 35 Overiap 
SnpDetectionWithPolyBayes. The submitter of the data Is Pui-Yan Kwok from Washinoton i in:«-«r*., *Z 
2000 release of NCBI dbSNP (www^cbI^lnu,ih.gov/SNP/) wi JlX m.«,o^^^^^ ^°IT^Z 
TSC-WUCSC The submitter of the data Is Lincoln Stein Lm Cold Spring Harbor Ub^rato" * 
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Fig, 13, Segmental duplica- 
tions between chromo- 
somes \n the human ge- 
nome. The 24 panels show 
the 1077 duplicated blocks 
of genes, containing 10310 
pairs of genes in total Each 
One represents a pair of ho- 
mologous genes belonging 
to a block; all blocks con- 
tain at least three genes 
on each of the chromo- 
somes where they appear. 
Each panel shows all the 
- duplications between a 
single chromosome and 
other chromosomes with 
shared blocks. The chro- 
mosome at the center of 
each panel is shown as a 
thick red line for emphasis. 
Other chromosomes are 
displayed from top to bot- 
. tom within each panel or- 
dered by chromosome 
number. The inset (bot- 
tom, center right) shows a 
close-up of one duplica- 
tion between chromo- 
somes 18 and 20, expand- 
ed to display the gene 
names of 12 of the 64 
gene pairs shown. 
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somes, and whether this heterogeneity is 
greater than expected by chance. If SNPs 
occur by random and independent mutations, 
then it would seem that there ought to be a 
Poisson distribution of numbers of SNPs in 
fragments of arbitrary constant size. The ob- 
served dispersion in the distribution of SNPs 
in 100-kbp fragments was far greater than 
predicted from a Poisson distribution (Fig. 
14). However, this simplistic model ignores 
the different recombination rates and popula- 
tion histories that exist in different regions of 
the genome. Population genetics theory holds 
that we can account for this variation with a 
mathematical fonnulation called the neutral 
coalescent (109). Applying well-tested algo-. 
rithms for simulating the neutral coalescent 
with recombination {110}, and using an ef- 
fective population size of 10,000 and a per- 
base recombination rate equal to the mutation 
rate (III), we generated a distribution of num- 
bers of SNPs by this model as well (112). The 
observed distribution of SNPs has a much larg- 
er variance than either the Poisson model or the 
coalescent model, and the difference is highly 
significant This implies that there is significant : 
variability across the genome in SNP density, 
an observation that begs an explanation. 

Several attributes of the DNA sequence 
may affect the local density of SNPs, in- 
cluding the rate at which DNA polymerase 
makes errors and the efficacy of mismatch 
repair. One key factor that is likely to be 
associated with SNP density is the G+C 
content, in part because methylated cy- 
tosines in CpG dinucleotides tend to under- 
go deamination to form thymine, account- 
ing for a nearly 10-fold increase in the 
mutation rate of CpGs over other dinucle- 
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otides. We tallied the GC content and nu- 
cleotide diversities in 100-kbp windows 
across the entire genome and found that the 
correlation between them was positive (r = 
0.21) and highly significant (P < 0.0001), 
but G+C content accoimted for only a 
small part of the variation. 

6.5 SNPs by genomic class 

To test homogeneity of SNP densities 
across functional classes, we partitioned 
sites into intergenic (defined as >5 kbp 
from any predicted transcription \init), 5'- 
UTR, exonic (missense and silent), in- 
tronic, and 3'-UTR for 10,239 known , 
genes, derived from the NCBI RefSeq da- 
tabase and all human genes predicted from 
the Celera Otto annotation. In coding re- 
gions, SNPs were categorized as either si- 
lent, for those that do not change amino 
acid sequence, or missense, for those that 
change the protein product. The ratio of 
missense to silent coding SNPs in Celera- 
PFP, TSC, and Kwok sets (1.12, 0.91, and 
0.78, respectively) shows a markedly re- 
duced frequency of missense variants com- 
pared with the neutral expectation, consis- 
tent with the elimination by natural selec- 
tion of a fraction of the deleterious amino 
acid changes (772). These ratios are com- . 
parable to the missense-to-silent ratios of 
0.88 and 1.17 found by Cargill et al (101) 
and by Halushka et al, (102). Similar re- 
sults were observed in SNPs derived from 
Celera shotgun sequences (46). 

It is striking how small is the fraction of 
SNPs that lead to potentially dysfunctional 
alterations in proteins. In the 10,239 Ref- 
Seq genes, missense SNPs were only about - 
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Number of SNPs / 100 kb 

Fig. 14. SNP density in each 100-kbp interval as determined with Celera-PFP SNPs. The color codes 
are as follows: black, Celera-PFP SNP density; blue, coalescent model; and red, Poisson distribution. 
The figure shows that the distribution of SNPs along the genome is nonrandom and is not entirely 
accounted for by a coalescent model of regional history. 



0.12, 0.14, . and 0.17% of the total SNP 
counts in Celera-PFP, TSC, and Kwok 
SNPs, respectively. Nonconservative pro- 
tein changes constitute an even smaller fac- 
tion of missense SNPs (47, 41, and 40% in 
Celera-PFP, Kwok, and TSQ. Intergenic re- 
gions have been virtually unstudied (113), and 
we note that 75% of the SNPs we identified 
were intergenic (Table 17). The SNP rate was 
highest in introns and lowest in exons. The SNP 
rate was lower in intergenic regions than in 
introns, providing one of the first discriminators 
between these two classes of DNA. These SNP 
rates were confimied in the Celera SNPs, which 
. also exhibited a lower rate in exons than in 
introns, and in extragenic regions than in in- 
trons (46). Many of these intergenic SNPs will 
provide valuable information in the form of 
markers for linkage and association studies, and 
some fiaction is likely to have a regulatory 
function as well. 

7 An Overview of the Predicted 
Protein-Coding Genes in the Human 
Genome 

Summary. This section provides an initial 
computational analysis of the predicted 
protein set with the aim of cataloging 
prominent differences and . similarities 
when the human genome is compared with 
other fully, sequenced eukaryotic genomes. 
Over 40% of. the predicted protein set in 
humans cannot' be ascribed a molecular 
function by methods that assign proteins to 
known families. A protein domain-based 
analysis provides a detailed catalog of the 
prominent differences in the human ge- 
nome when compared with the fly and 
worm genomes. Prominent among these are 
domain expansions in proteins involved in 
developmental regulation and in cellular 
processes such as neuronal function, hemo- 
stasis, acquired immune response, and cy- 
toskeletal complexity. The final enumera- 
tion of protein families and details of pro- 
tein structure will rely on additional exper- 
imental work and comprehensive manual 
curation, 

A prelLminaiy analysis of the predicted hu- 
man protein-coding genes was conducted. 
Two methods were used to analyze and clas- 
sify the molecular functions of 26,588 pre- 
dicted proteins that represent 26,383 gene 
piredictionis with at least two lines of evidence 
ias described above. The first method was 
based on an analysis at the level of protein 
families, with both the publicly available 
Pfam database (114, 115) and Celera's Pan- 
ther Classification (CPC) (Fig. 15) (116). 
The second method was based on an analysis 
at the level of protein domains, with both the 
Pfam and SMART databases (115, 117). 

The results presented here are prel» 
nary and are subject to several limit" / 
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Both the gene predictions and functional 
assignments have been made by using com- 
putational tools, although the statistical 
Tiodels in Panther, Pfam, and SMART have 
»een built, annotated, and reviewed by ex- 
pert biologists. In the set of computationally 
predicted genes, we expect both false-positive 
predictions (some of these may in fact be inac- 
tive pseudogenes) and false-negative predic- 
• dons (some human genes will not be computa- 
tionally predicted). We also, expect errors in 
delimiting the boundaries of exons and genes. 
Similarly, in the automatic functional assign- 
ments, we also expect both false-positive and 
false-negative predictions. The functional as- 
signment protocol focuses on protein families 
that tend to be found across several organisms, 
or on families of known human genes. There- 
fore, we do not assign a function to many genes 
that are not in large families, even if the func- 
tion is known. Unless otherwise specified, all 
enumeration of the genes in any given family or 
functional category was taken from the set of 
26,588 predicted proteins, which were assigned 
functions by using statistical score cutoffs de- 
fined for models in Panther, Pfam, and 
SMART. 

For this initial examination of the pre- 
dicted human protein set, three broad ques- 
tions were asked: (i) What are the likely 
molecular functions of the predicted gene 
products, and how are these proteins cate- 
^ prized with current classification meth- 
Is? (ii) .What are the core functions that 
appear to be common across the animals? 
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(iii) How does the human protein comple- 
ment differ from that of other sequenced 
eukaryotes? 

7.1 Molecular functions of predicted 
human proteins 

Figure 15 shows an overview of the puta- 
tive molecular functions of the predicted 
26,588 human proteins that have at .least 
two lines of. supporting evidence. About 
41% (12,809) of the; geiie products could 
not be classified from this initial analysis 
and are termed' proteins with unknown 
functions. Because our automatic classifi- 
cation methods treat only relatively large 
protein families, there are a number of 
"unclassified" sequences that do, in fact, 
have a known or predicted function. For the 
60% of the protein set that have automatic 
functional predictions, the specific protein 
functions have been placed into broad 
classes. We focus here on molecular func- 
tion (rather than higher order cellular pro- 
cesses) in order to classify as many proteins 
as possible. These functional predictions 
are based on similarity to sequences of 
known function. 

In our analysis of the 12,731 additional low- 
confidence predicted genes (those with only 
one piece of supporting evidence), only 636 
(5%) of these additional putative genes were 
assigned molecular functions by the automated 
methods. One-third of these 636 predicted 
genes represented endogenous retroviral pro- 
teins, further suggesting fliat the majority of 



these unknown-function genes are not real 
genes. Given that most of these additional 
12,095 genes appear to be unique among the 
genomes sequenced to date, many may simply 
represent false-positive gene predictions. 

The most common molecular functions are 
the transcription factors and those involved in 
nucleic acid metaboUsm (nucleic acid enzyme). 
Other functions that are highly represented in 
the . human genome are the receptors, kinases, . 
and hydrolases. Not ;suiprisingly,"most of the " 
hydrolases are proteases. There are also many 
proteins that are members of proto-oncogene 
families, as well as families of "select regula- 
tory molecules": (i) proteins involved in specif- 
ic steps of signal transduction such as hetero- 
trimeric GTP-binding proteins (G proteins) and 
ceU cycle regulators, and (ii) proteins tiiat mod- 
ulate the activity of kinases, G proteins, and 
phosphatases. 

Table 17. Distribution of SNPs in classes of 
genomic regions. 



Genomic region 
class 


Size of 
region 
examined 
(Mb) 


Celera-PFP 
SNP 
density 
(SNP/Mb) 


Intergenic 


2185 


707 


Gene (intron + 


646 


917 


exon) 






Intron 


615 


921 


First intron 


164 


808 


Exon 


31 


529 


First exon 


10 


592 



cell adhesion (577, 1,9%) 
miscellaneous (1318. 4.3%) 
viral protein (100. 0.3%), 
tnnsfcr/camcr protein (203, 0.7%) 
transcription factor (1850, 6,0%) 



nucleic acid co^mc (2308, 7,5%) 



signaling molecule (376, 1.2%) 



receptor {1 543. 5.0%) 



kinase (868. 2.8%) 

scicct regulatory molecule (988, 3.2%) 

transferase (6 r 0,2.0%) 
synthase and synthetase (313.1 .0%) 
OKidorcductasc (656, 2.1%) 

^a$c(U7,0.4%) 
ngasc(56,0i%) 
isomcrasc(l63.0.5%) 

hydrolase (1227, 4,0%) 




chaperonc(l59.0J%) 

kcletalstrttctural protein (876. 2.8%) 
extraecDuhr matrix (437, 1.4%) 

iultn(264.0.9%) 
ion channel (406.13%) 
motor (376, 1:2%) 

structural protdD of muscle (296^ 1 .0%) 
piDtooncogcne (902. 2.9%) 
-jcltxt cakiura binding protein (34. 0.1%) 
intraccnuJartran^rtef (350, 1.1%) 
transporter (533j 1.7%) 



GO categories 



Fig. 15. Distribution 
of the molecular 
functions of 26,383 
human genes. Each 
slice lists the num- 
bers and percentages 
(in parentheses) of 
human gene functions 
assigned to a given 
category of molecular 
function. The outer cir- 
cle shows the assign- 
ment to molecular 
function categories in 
the Gene" Ontology 
(GO) {ml and the 
inner drcle shows 
the assignment to 
Celera's Panther mo- 
lecular function cate- 
gories (776). 



molccabr fisKtton unlcnown ( 1 2809. 4 1 .7%) 



Panther categories 
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7.2 Evolutionary conservation of core 
processes 

Because of the various . "model organism" 
genome-sequencing projects that have al- 
ready been completed, reasonable compara- 
tive information is available for beginning the 
analysis of .the evolution of the human ge- 
nome. The genomes of S. cerevisiae (**bak-. 
ers' yeast") (IJ8) and two diverse inverte- 
brates, C. elegans (a nematode worm) (J 19) 
and Z>. melanogaster (fly) {26), as well as the 
first plant genome, A. thaliana, recently com- 
pleted (P2), provide a diverse background for 
genome comparisons. 

We enumerated the "strict orthologs" con- 
served between human and fly, and between 
human and wonn (Fig. 16) to address the 
question. What are the core functions that 
appear to be common across the animals? 
The concept of orthology is important be- 
cause if two genes are orthologs, they can be 
traced by descent to the common ancestor of 
the two organisms (an "evolutionarily con- 
served protem set"), and therefore are likely 
to perform similar conserved functions in the 
different organisms. It is critical in this anal- 
ysis to separate orthologs (a gene that appears 
in two orgariisms by descent from a common 
ancestor) from paralogs (a gene that appears 
in more than one copy in a given organism by 
a duplication event) because paralogs may 
subsequently diverge in function. Following 
the yeast-worm ortholog comparison in 
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{120), we identified two different cases for* 
each pairwise comparison (human-fly and 
human-worm). The first case was a pair of 
genes, one from each organism, for which 
there was no other close homolog in either 
organism. These are straightforwardly identi- 
fied as orthologous, because there are no 
/ additional members of the families that com- 
plicate separating orthologs from paralogs. 
The second case is a family of genes with 
. more than one member in either or both of the 
organisms being compared. Chervitz et al. 
{120) deal with this case by analyzing a 
phylogenetic tree that described the relation- 
ships between all of the sequences in both 
.. organisms, and then looked for pairs of genes 
.that were nearest neighbors in the tree. If the 
nearest-neighbor pairs were from different 
organisms, those genes were presumed to be 
orthologs. We note that these nearest neigh- 
bors can often be confidently identified from 
pairwise sequence comparison without hav- 
ing to examine a phylogenetic tree (see leg- 
end to Fig. 16). If the nearest neighbors are 
not from different organisms, there has been 
a paralogous expansion in one or both organ- 
isms after the speciation event (and/or a gene 
loss by one organism). When this one-to-one 
correspondence is lost, defining an ortholog 
becomes ambiguous. For our initial compu- 
tational overview of the predicted human pro- 
tein set, we could not answer this question for . 
every predicted protein. Therefore, we con- 



sider only "strict orthologs," i.e., the proteins 
with unambiguous one-to-one relationships 
(Fig. 16). By these criteria, there are 2758 
strict human-fly orthologs. 2031 human- 
worm (1523 in common between these sets). 
We defme the evolutionarily conserved set as 
those 1523 human proteins that have strict 
orthologs in both D. .melanogaster and C 
elegans. \ 

The distribution of the functions of the 
conserved protein set is shown in Fig. 16. 
Comparison with Fig. 15 shows that, not 
surprisingly, the set of conserved proteins is 
.not distributed among molecular fiinctions in 
the same way as the whole human protein set 
Compared with the whole human set (Fig. 
15), there are several categories that are over- 
represented in the conserved set by a factor of 
--2 or more. The first category is nucleic acid 
enzymes, primarily the transcriptional ma- 
chinery (notably DNA/RNA methyltrans- 
ferases, DNA/RNA polymerases, helicases, 
DNA ligases, DNA- and RNA-processmg 
factors, nucleases, and ribosomal proteins). 
The basic, transcriptional and translational 
machinery is well known to have been con- 
served over evolution, from bacteria through 
to the most complex eukaryotes. Many ribo- 
nucleoproteins involved in RNA splicing also 
appear to be conserved among the animals. 
Other enzyme types are also overrepresent- 
ed (transferases, oxidoreductases, ligases, 
lyases, and isomerases). Many of these en- 



Fig. 16. Functions of putative 
orthologs aaoss vertebrate 
and invertebrate genomes. 
Each slice lists the number and 
percentages (in parentheses) 
of "strict orthologs" between 
the human, fly, and worm ge- 
nomes involved in a given cat- 
egory of molecular function. 
"Strict orthologs" are defined 
here as bi-directional BLAST 
best hits (780) such that each 
orthologous pair (i) has a 
BLASTP P-value of ^^0-'^^ 
{120), and (ii) has a more sig- 
nificant BLASTP score than 
any paralogs in either organ- 
Ism, (.e.. there has likely been 
no duplication subsequent to 
speciation that might make 
the orthology ambiguous. This 
measure is quite strict and is a 
lower bound on the number of 
orthologs. By thiese criteria, 
there are 2758 strict human- 
fly orthologs, and 2031 hu- 
man-wonm orthologs (1523 in 
common between these sets). 
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motor (13. 0.S%) 
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^es are involved in intermediary metabo- 
lism. The only exception is the hydrolase 
category, which is not significantly overrep- 
-esented in the shared protein set. Proteases 
^ irm the largest part of this category, and 
'several large protease families have expanded 
in each of these three organisms after their 
divergence. The category of select regulatory 
molecules is also overrepresented in the con- 
served set. The major conserved families are 
small guanosine triphosphatases (GTPases). 
(especially the Ras-related superfamily. in- 
cluding ADP ribosylation factor) and cell 
cycle regulators (particularly the cuUin fam- 
ily, cyclin C family, and several cell division 
protein kinases). The last two significantly 
overrepresented categories are protein trans- 
port and trafficking, and chaperones. The . 
most conserved groups in these categories are 
proteins involved in coated vesicle-mediated 
transport, and chaperones involved in protein 
folding and heat-shock response [particularly 
the DNAJ family, and heat-shock protein 
60 (HSP60). HSP70, and HSP90 families]. 
These observations provide only a conserva- 
tive estimate of the protein families in the 
context of specific cellular processes that 
were likely derived from the last common 
ancestor of the human, fly, and worm. As 
stated before, this analysis does not provide a 
complete estimate of conservation across the 
three animal genomes, as paralogous dupli- 
>Jon makes the determination of true or- 
logs difficult within the members of con- 
served protein families. 
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7.3 Differences between the human 
genome and other sequenced 
eukaryotic genomes 
To explore the molecular building blocks of 
the vertebrate taxon, we have compared the 
human genome with the other sequenced 
eukaryotic genomes at three levels: molec- 
ular functions, protein families, and protein 
domains. 

Molecular differences can be correlated 
with phenotypic differences to begin to reveal 
the developmental and cellular processes that 
are unique to the vertebrates. Tables 18 and 
1 9 display a comparison among all sequenced 
eukaryotic genomes, over selected protein/ 
domain families (defined by sequence simi- 
larity, e.g., the serine-threonine protein ki- 
nases) and superfamilies (defined by shared 
molecular function, which may include sev- 
eral sequence-related families, e.g., the cyto- 
kines). In these tables we have focused on 
(super) families that are either very large or 
that differ significantly in humans compared 
with the other sequenced eukaryote genomes. 
V^have found that the most prominent hu- 
^M^expansions are in proteins involved in (i) 
^^red immune functions; (ii) neural devel- 
opment, structure, and functions; (iii) inter- 
cellular and intracellular signaling pathways 



in development and homeostasis; (iv) hemo- 
stasis; and (v) apoptosis. 

Acquired immunity. One of the most 
striking differences between the human ge- 
nome and the Drosophila or C. elegans ge- 
nome is the appearance of genes involved in 
acquired immunity (Tables 18 and 19). This 
is expected, because the acquired immune 
response is a defense system that only occurs 
.in vertebrates. We observe 22 class I and 22 
class n major. : histocompatibility complex •. 
(MHC) antigen genes and 114 other immu- 
noglobulin genes in the human genome. In 
addition, there are 59 genes in the cognate 
immunoglobulin receptor family. At the do- 
main level, this is exemplified by an expan- 
sion and recruitment of the ancient immuno- 
globulin fold to. constitute molecules such as 
MHC, and of the integrin fold to form several 
of the cell adhesion molecules that mediate 
interactions between immune effector cells 
and the extracellular matrix. Vertebrate-spe- 
cific proteins include the paracrine immune 
regulators family of secreted 4-alpha helical 
bundle proteins, namely the cytokines and 
chemokines. Some of the cytoplasmic signal 
transduction components associated with cy- 
tokine receptor signal transduction are also 
features that are poorly represented in the fly 
and worm. These include protein domains 
found in the signal transducer and activator of 
transcription (STATs), the suppressors of cy- 
tokine signaling (SOCS), and protein inhibi- 
tors of activated STATs (PIAS). In contrast, 
many of the animal-specific protein domains . 

• that play a role in iiuiate immune response, 
such as the Toll receptors, do not appear to be 
significantly expanded in the human genome. 

Neural development, structure, and 

• function. In the human genome, as compared 
with the wonn and fly genomes, there is a 
marked increase in the number of members 
of protein families that are involved in 
neural development. Examples include neu- 
rotrophic factors such as ependymin, nerve 
growth factor, and signaling molecules 
such as semaphorins, as well as the number 
of proteins involved directly in neural 
structure and function such as myelin pro- 
teins,, voltage-gated ion channels, and syn- 
aptic proteins such as synaptotagmin. 
These observations correlate well with the 
known phenotypic differences between the 
nervous systems of these taxa, notably (i) 
the increase in the number and connectivity 
of neurons; (ii) the increase in number of 
distinct neural cell types (as many as a 
thousand or more in human compared with 
a few hundred in fly and worm) {121)\ (iii) 
the increased length of individual axons; 
and (iv) the significant increase in glial cell 
number, especially the appearance of my- 
elinating glial cells, which are electrically 
inert supporting cells differentiated from 
the same stem cells as neurons. A number 



of prominent protein expansions are in- 
volved in the processes of neural develop- 
ment. Of the extracellular domains that me- 
diate cell adhesion, the connexin domain- 
containing proteins {122) exist only in hu- 
mans. These proteins, which are not present 
in the Drosophila or C elegans genomes, 
appear to provide the constitutive subunits 
of intercellular channels and the structural 
basis for electrical coupling.; Pathway find- 
ing by axons and neuronal network forma- 
tion is mediated through a subset of ephrins 
and their cognate receptor tyrosine kinases 
that act as positional labels to establish 
topographical projections (/2i). The prob- 
able biological role for the semaphorins (22 
in human compared with 6 in the fly and 2 
in the worm) and their receptors (neuropi- 
lins and plexins) is that of axonal guidance 
molecules {124). Signaling molecules such 
as neurotrophic factors and some cytokines 
have been shown to regulate neuronal cell 
survival, proliferation, and axon guidance 
(725). Notch receptors and ligands play 
important roles in glial cell fate determina- 
tion and gliogenesis {126), 

Other human expanded gene families play 
key roles directly in neural structure and 
function. One example is synaptotagmin (ex- 
panded more than twofold in humans relative 
to the invertebrates), originally found to reg- 
ulate synaptic trarismission by serving as a 
Ca^"^ sensor (or receptor) during synaptic 
vesicle fiision and release (727). Of interest is 
the increased , co-occurrence in humans of 
PDZ and the SH3 domains in neuronal- 
specific adaptor molecules; examples include 
proteins that likely modulate channel activity 
at synaptic junctions {128), V/e also noted 
expansions in several ion-channel families 
(Table 19), including the EAG subfamily 
(related to cyclic nucleotide gated channels), 
the voltage-gated calcium/sodium channel 
family, the inward-rectifier potassium chan- 
nel family, and the. voltage-gated potassium 
channel, alpha subunit family. Voltage-gated 
sodium and potassium channels are involved 
in the generation of action potentials in neu- 
rons. Together with voltage-gated calcium 
channels, they also play a key role in cou- 
pling action potentials to neurotransmitter re- 
lease, in the development of neurites, and in 
short-term memory. The recent observation 
of a calcium-regulated association between 
sodium channels and synaptotagmin may 
have consequences for the establishment and 
regulation of neuronal excitability (72P), 

Myelin basic protein and myelin-associat- 
ed glycoprotein are major classes of protein 
components in both the central and peripheral 
nervous system of vertebrates. Myelin PO is a 
major component of peripheral myelin, and 
myelin proteolipid and myelin oligodendro- 
cyte glycopotein are found in the central 
nervous system. Mutations in any of these 
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Table 18, Domain-based comparative analysis of proteins In H. sapiens (H) 
D. melanogaster (F), C elegans (W). S. cerevtsiae (Y). and>^. thaliana (A) The 
predicted protein set of each of the above eukaryotic organisms was analyzed 
with Pfam version 5.5 using E value cutoffs of 0.001. The number of proteins 
containing the specified Pfam domains as weU as the total number of domains 
(in parentheses) are shown in each column. Domains were categorized into 
cellular processes for presentation. Some domains (i.€.. SH2) are listed In 



more than one cellular process. Results of the Pfam analysis may differ from 
results obtained based on human curation of protein families, owing to the 
limitations of. large-scale automatic classifications. Representative examples 
of domains with reduced counts owing to the stringent E value cutoff used for 
th^s analysis are marked with a double asterisk Examples include short 
divergent and predominantly alpha-helical domains, and certain classes of 
cysteine-rich zinc finger proteins. 



Accession 
number 



Domain name 



Domain description 



H 



W 



PF02039 

PF00212 

PF00028 

PF00214 

PF01110 

PF01093 

PF00029 

PF00976 

PF00473 

PF00007 

PF00778 

PF00322 

PF00812 

PF01404 

PF00167 

PF01534 

PF00236 

PF01153 

PF01271 

PF02058 

PF00049 

PF00219 

PF02024 

PF00193 

PF00243 

PF02158 

PF06l84 

PF02070 

PF00066 

PF00865 

PF00159 

PF01279 

PF00123 

PF00341 

PF01403 

PF01033 

PF00103 

PF02208 

PF02404 

PF01034 

PF00020 

PF00019 

PF01099 

PF01160 

PF00110 

PF01821 
PF00386 
PF00200 
PF007S4 
PF01410 
.PF00039 
PF00040 
PF00051 
PF01823 
PF00354 
PF00277 
PF00084 
PF02210 

PFonoa 

PF0086d 
PF00927 



Adrenomedullin 
ANP 
Cadherin 
Calc.CGRP lAPP 
CNTF 
.-Clusterin . 
Connexin 
ACTH_domain 
CRF 

Cys.knot 
DIX 

Endothelin 
Ephrin 
EPhJbd 
FGF 
Frizzled 
Hormones 
Ctypican 
Cranin 
Cuanylin 
Insulin 
ICFBP 
Leptin 
Xlink 
NGF 

Neuregulin 
Hormones 

Notch 

Osteopontin 
Hormones 
Parathyroid 
Hormone2 
PDGF 
Sema 

Somatomedin's 
Hormone 
Sorb 
SCF 

Syndecan 
TNFR_c6 
TGF-p 
Uteroglobin 
Opiods^neuropep 
Wnt 

ANATO 
Clq 

Disintegrin 
F5_F8_type_C 
COLFI 
Fnl 
Fn2 
Kringle 
MACPF 
Pentaxin 
SAAj)rotelns 
Sushi 
TSPN 
Tissuejac 
Transglutamin_N 
Transglutamin_C 



. Developmental and homeosta tic 

. Adrenomedullin 
Atrial natriuretic peptide 
Cadherin domain 
Calcitonin/CGRP/IAPP family 
. Ciliary neurotrophic factor 
Clusterin 
Connexin 

Corticotropin ACTH domain 

Corticotropin-releasing factor family 

Cystine-knot domain 

Dtx domain 

Endothelin family 

Ephrin 

Ephrin receptor ligand binding domain 
Fibroblast growth factor 
Frizzled/Smoothened family membrane region 
Glycoprotein hormones 
Glypican 

Grainin (chromograntn or secretogranin) 

Guanylin precursor 

Insulin/IGF/Relaxin family 

Insulin-like growth factor binding proteins 

Leptin 

LINK (hyaluron binding) 
Nerve growth factor family 
Neuregulin family 
Neurohypophysial hormones 
Neuromedin li 
Notch (DSL) domain 
Osteopontin 

Pancreatic hormone peptides 
Parathyroid hormone family 
Peptide hormone 

Platelet-derived growth factor (PDGF) 
Sema domain 
Somatomedin B domain 
Somatotropin 

Sorbin homologous domain 

Stem cell factor 

Syndecan domain 

TNFR/NGFR cysteine-rich region 

Transforming growth factor p-like domain 

Uteroglobin family 

Vertebrate endogenous opioids neuropeptide 
Wnt family of developmental signaling proteins 

Hemostasis 

Anaphylotoxin-like domain 
Clq domain 
Disintegrin 

F5/8 type C domain • 

Fibrillar collagen C-terminal domain 

Fibronectiri type I domain 

Fibronectin type 11 domain 

Kringle domain 

MAaPerforin domain 

Pentaxin family 

Serum amyloid A protein 

Sushi domain (SCR repeat) 

Thrombospondin N-terminal-like domains 

Tissue factor 

Transglutaminase family 

Transglutaminase family 



regulators 

1 
2 

100 (550) 
3 
1 
3 

. . 14(16) 
1 
2 

10(11) 
5 

7(8) 
12 
23 
9 
1 

14 
3 
1 
7 

10 
1 

13(23) 
3 
4 
1 
1 

3(5) 
1 
3 
2 

5(9) 
5 

27(29) 
5(8) 
1 
2 
2 

17(31) 
27(28) 
3 
3 
18 



0 
0 

14(157) 
0 
0 
0 
0 
0 
1 
2 
2 
0 
2 
2 
1 
7 
0 
2 
0 
0 
4 
0 
0 
0 
0 
0 

/ 0 
0 

2(4) 
0 
0 
0 
0 

1 

8(10) 
3 
0 
0 
0 
1 
1 
6 
0 
0 

7(10) 



0 
0 

16(66) 
0 
0 
0 
0 
0 
0 
0 
4 
0 
4 
1 
1 
3 
0 
1 
0 
0 
0 
0 
0 

1 

0 
0 
0 
0 

2(6) 
0 
0 
0 
0 
0 

3(4) 
0 
0 
0 
0 

1 

0 
4 
0 

d 

5 



6(14) 


0 


0 


24 


0 


0 


18 




3 


15(20) 


5(6) 


2 


.10 


0 


0 


5(18) : 


0 


0 


11(16) 


0 


0 


15(24) 


2 


2 


6 


0 


0 


9 


0 


0 


4 


0 


0 


53(191) 


11(42) 


8(45) 


14 


1 


0 


1 


0 


0 


6 


1 


0 


8 


1 


0 



0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

0^ 
0 



0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
. 0 
0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 



0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
' 0 
0 
0 
0 
. 0 

6 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
■ 0 
0 
0 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
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Accession 
number 



Domain name 



Domain description 



W 



^PF00S94 



Gia 



PF00711 
PF00748 
Pf00666 
- PF0p129 

PF00993 

PF00969 

PF00879 

PFOn09 

PF00047 

PF00143 

PF00714 

PF00726 

PF02372 

PF00715 

PF00727 

PF02025 

PF01415 

PF00340 

PF02394 

PF02059 

PF00489 

PF01291 

PF00323 
PF01091 
PF00277 
PF00048 



ft 



01582 
00229 
F00088 



PF0O779 
PF00168 
PF00609 
PF00781 
PF00610 

PF01363 
PF00996 
PF00503 
PF00631 
PF00616 
PF00618 

PF00625 
PF02189 
PF00169 
PF00130 

PF00388 

PF00387 

PF00640 
PF02192 
PF00794 
PF01412 
PF02196 
PF02145 
PF00788 
^■fi0071 
^■b617 
^TO0615 
PF02197 



Defensln.beta 
Calpainjnhib 
Cathelicidins 
MHCJ 

MHCJI.alpha** 
MHCJI.beta** 
Defensin_propep 
GM.CSF 
Ig 

Interferon 
IFN-gamma 
IL10 
IL15 
IL2 
IL4 
IL5 
IL7 
IL1 

IL1_propep 

IL3 

IL6 

LIF.OSM 

Defensins 
PTN.MK 
SAA_proteins 
IL8 

TIR 
TNF 
Trefoil 

BTK 
C2 

DAGKa 
DAGKc 
DEP 

FYVE 
GDI 

G-alpha 
G-gamma 
RasGAP 
RasCEFN 

Guanylate.kin 
fTAM . 
PH 

DAG.PE-bind 
PI-PLC-X 
Pl-PLC-Y 
PID 

PI3iep85B 
PI3ierbd 
ArfGAP 
RED 

Rap.GAP 

RA 

Res 

RasGEF 

RGS 

Rlla 



Vitamin K-dependent carboxylation/gamma- 
carboxyglutamic (GLA) domain 

Immune response 

Beta defensin 

Calpain inhibitor repeat • 

Cathelicidins ' 

Class I histocompatibility antigen, domains alpha 1 

and 2 

Class II histocompatibility antigen, alpha domain 
Class II histocompatibility antigen, beta domain 
Defensin propeptide 

Granulocyte-macrophage colony-stimulating factor 

Immunoglobulin domain 

Interferon alpha/beta domain 

Interferon gamma 

lnterleukin-10 

lnterleukin-15 

lnterleukin-2 

lnterleukin-4 

lnterleukin-5 

lnterleukin-7/9 family 

lnterleukin-1 

Interieukin-1 propeptide 

Interleukin-3 

lnterleukin-6/G-CSF/MCF family 

Leukemia inhibitory factor {UF)/oncostatin (OSM) 

family 
Mammalian defensin 
PTN/MK heparin-binding protein 
Serum amyloid A protein 
Small cytokines (intecrine/chemokine), 

interleukin-8 like 
TIR domain 

TNF (tumor necrosis factor) family . 
Trefoil (P-type) domain 

PI-PY-rbo CTPase signaUng 
BTK motif ^ 
C2 domain 

Diacylglycerol kinase accessory domain (presumed) 
Diacylglycerol kinase catalytic domain (presumed) 
Domain found in Dishevelled. Egl-10, and 

Pleckstrin (DEP) 
FYVE zinc finger 
GDP dissociation inhibitor 
G-protein alpha subunit 
G-protein gamma like domains 
CTPase-activator protein for Ras-like GTPase 
Guanine nucleotide exchange factor for Ras-like 

GTPases; N-terminal motif 
Guanylate kinase 

Immunoreceptor tyrosine-based activation motif 
PH domain 

Phorbol esters/dia<^lglycerol binding domain (CI 
domain) 

Phosphatidylinositol-specific phospholipase C, X 
domain 

Phosphatidylinositol-specific phospholipase C, Y 
domain 

Phosphotyrosine interaction domain (PTB/PID) 
PI3-klnase family. p85-binding domain 
PI3-kinase family, ras-binding domain 
Putative GTP-ase activating protein for Arf 
Raf-like Ras-binding domain 
Rap/ran-GAP 

Ras association (RalGDS/AF-6) domain 
Ras family 
RasGEF domain 

Regulator of G protein signaling domain 
Regulatory subunit of type tl PKA R-subunit 



11 



. 3(9) 

: ■ ■ - 2 

18(20) 

5(6) 
7 
3 
1 

381 (930J 
7(9) 
1 
1 
1 
1 
1 
1 
1 
7 
1 
1 
2 
2 

2 
2 
4 
32 

18 
12 
5(6) 

5 

73(101) 
9 
10 
12(13) 

28 (30) 
6 

27(30) 
16 
11 
9 

12 

3 

193 (212) 
45(56) 

12 

11 

24(27) 
2 
6 
16 
6(7) 
5 

18(19) 
126 
21 
27 
4 



0 
0 
0 

: 0 

0 
0 
0 
0 

125 (291) 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 



13 
1 
3 
9 
4 
4 

7(9) 
56(57) 
8 

6(7) 
1 



0 
0 
0 

. * 0 

0 
0 
0 
0 

67(323) 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 



0 
0 

. 0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 



11(12) 

1 
1 

8 

1 

2 
6 
51 
7 

12(13) 
2 



0 
0 
0 
6 
0 
0 
1 
23 
5 
1 
1 



0 
0 

. 0 
"0. 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 
0 
0 
0 
0 



0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


8 


2 


b 


131 (143) 


0 


0 


0 


. 0 


0 


2 


0 


0 


1 


0 


0 


0 


32(44) 


24(35) 


6(9) 


66 (90) 


4 


7 


0 


6 


8 


8 


2 


11(12) 


4 


10 


5 


2 


14 


15 


5 


15 


2 


1 


1 


3 


10 


20(23) 


2 


5 


5 


5 


1 


0 


5 


8 


3 


0 


2 


3 


5 


0 


8 


7 


1 


4 


0 


0 


0 


0 


72(78) 


65 (68) 


24 


23 


25(31) 


26(40) 


1(2) 


4 


3 


7 


1 


8 


2 


7 


1 


8 



0 
0 
0 

15 
0 
0 
0 

78 
0 
0 
0 
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Domain name 



Domain description 



PF00620 
PF00621 
PFO0536 
Pf01369 
Pf00017 
PF00018 
PF01017 
PF00790 
PF00568 

PF00452 
PF02180 
PF00619 
PF00531 
PF01335 
PF02179 
PF00656 
PF00653 

PF00022 
PF00191 
PF00402 
PF00373 
PF00880 
• PF00681 
PF00435 
PF00418 
PF00992 
PF02209 
PF01044 

PF01391 
PF01413 

PF00431 
PF00008 
PF00147 

PF00041 
PF00757 
PF00357 
PF00362 
PF00052 
PF00053 
PF00054 
PF00055 
PF00059 
PF01463 
PF01462 
PF00057 . 
PF00058 
PF00530 
PF00084 
PF00090 
PF00092 
PF00093 
PF00094 

PFO6244 
PF00023 
PF00514 
PF00168 
PF00027 
PF015S6 
PF00226 
PF00036 
PF00611 
PF01846 
PF00498 



RhoGAP 
RhoCEF 
SAM 
Sec7 
SH2 
SH3 
STAT 
VHS 
WH1 

Bcl-2 
BH4 
CARD 
Death 
DED 
BAG 
ICE_p20 
BIR 

Actin 
Annexin 
Calponin 
Band_41 
Nebulin„repeat 
Plectin_repeat 
Spectrin 
Tubulin-binding 
Troponin 
VHP ■ 
Vinculin 

Collagen 
C4 

CUB 
ECF 

Fibrinogen_C 
Fn3 

Furin-like 
lntegrin_A 
lntegrin_B 
Laminin_B 
Laminin_ECF 
Laminin_C 
Laminin_Nterm 
Lectin.c 
LRRcf 
LRRNT 
LdLrecept^a 
LdLrecept.b 
SRCR 
Sushi 
Tsp_l 
Vwa 
Vwc 
Vwd 

14-3-3 
Ank 

Armadillo seg 
C2 

cNMP_binding 
DnaJ_C 
DnaJ 
Efhand** 
FCH 
FF 
FHA 



RhoCAP domain 
RhoCEF domain 

SAM domain (Sterile alpha motif) 
Sec7 domain 

Src homology 2 (SH2) domain 
Src homology 3 (SH3) domain 
STAT protein 
VHS domain 
WHl domain 

Domains involved in apoptosis 
Bcl-2 - 

Bcl-2 homology region 4 
. Caspase recruitment domain 
Death domain 
Death effector domain 
Domain present in Hsp70 regulators 
ICE-like protease (caspase) p20 domain 
Inhibitor of Apoptosis domain 

. Cytoskeletat 

Actin 
Annexin 
Calponin family 

FERM domain (Band 4.1 family) 
• Nebulin repeat 
Plectin repeat 
Spectrin repeat 

Tau and MAP proteins, tubulin-binding 
Troponin 

Villin headpiece domain 
Vinculin family 

fCM adhesion 
Collagen triple helix repeat (20 copies) 
C-terminal tandem repeated domain in type 4 

procollagen 
CUB domain 
ECF-like domain 

Fibrinogen beta and gamma chains, C-termtnal 

globular domain 
Fibronectin type Hi domain 
Furin-like cysteine rich region 
Integrin alpha cytoplasmic region 
Integrins, beta chain 
Laminin B (Domain IV) 
Lamintn EGF-Uke (Domains lll'and V) 
Laminin C domain 
Lamintn N-terminal (Domain VI) 
Lectin C-type domain 
Leucine rich repeat C-terminal domain 
Leucine rich repeat N-terminal domain 
Low-density lipoprotein receptor domain class A 
Low-density lipoprotein receptor repeat class B 
Scavenger receptor cysteine-rich domain 
Sushi domain (SCR repeat) 
Thrombospondin type 1 domain 
von Willebrand factor type A domain 
von Willebrand factor type C domain 
von Willebrand factor type D domain 

Protein interaction domains 

14-3-3 proteins 
Ank repeat 

Armadillo/beta-catenin-like repeats 
C2 domain . 

Cyclic nucleotide-binding domain 
DnaJ C terminal region 
DnaJ domain 
EF hand 

Fes/CIP4 homology domain 
FF domain 
FHA domain 



59 
46 
29(31) 
13 

87(95) 
143(182) 
7 
4 
7 

9 

. 3 
16 
16 
4(5) 
5(8) 
11 
8(14) 

61 (64) 
16(55) 
13(22) 
29(30) 
4(148) 

2(11) 
31 (195) 
4(12) 
4 
5 
4 

65 (279) 
6(11) 

47(69) 
108 (420) 
26 



106(545) 
5 
3 
8 

8(12) 
24(126) 
30(57) 
10 

47(76) 
69(81) 
40(44) 
35(127) 
15(96) 
11(46) 
53(191) 
41 (66) 
34(58) 
19(28) 
15(35) 



20 

145(404) 
22(56) 
73(101) 
26(31) 
12 
44 

83(151) 
9 

4(11) 
13 



19 
23(24) 
15 
5 

33(39) 
55(75) 
1 
2 

2 • 

2 
0 
0 
5 
0 
3 
7 

5(9) 

15(16) 
4(16) 
3 

17(19) 

1(2) 
0 

13(171) 
1(4) 
6 
2 
2 

10(46) 
"2(4) 

9(47) 
45(186) 
10(11) 

42(168) 
2 
1 
2 

4(7) 
9(62) 
18(42) 
6 

23(24) 
23 (30) 
7(13) 
33(152) 
9(56) 
4(8) 
11(42) 
11(23) 

0 . 
6(11) 
3(7) 

3 

72(269) 
11(38) 
32 (44) 
21(33) 
9 
34 

64(117) 
3 

4(10) 
15 



20 
18(19) 
8 
5 

44(48) 
46(61) 

1(2) 
4 

.2(3) 

1 
1 
2 
7 
0 
2 
3 

2(3) 

12 
4(11) 
7(19) 
11(14) 
1 
0 

10(93) 
2(8) 
8 
2 
1 



9 
3 
3 
5 

23(27) 
0 
4 
1 

0 
0 
0 
0 
0 
1 
0 

1(2) 

nil) 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 



8 
0 
6 
9 
3 
4 
0 
8 
0 

0 
.0 
0 
0 
0 
5 
0 
0 



24 
6(16) 
0 
0 
0 
0 
0 
0 
0 
5 

0. 



174(384) 


. 0 


0 


3(6) 


0 


0 


43(67) 


0 


0 


54(157) 


0 


1 


6 


0 


0 


34(156) 


0 


1 


1 


0 


0 


2 


0 


0 


2 


0 


0 


6(10) 


0 


0 


11(65) 


0 


0 


14(26) • 


0 


0 


4 


0 


0 


91 (132) 


0 


0 


7(9) 


0 


0 


3(6) 


0 


0 


27(113) 


0 


0 


7(22) 


0 


0 


1(2) 


0 


0 


8(45) 


0 


0 


18 (47) 


0 


0 


17(19) 


0 


1 


2(5) 


0 


0 


9 




0 


3 


2 


15 


75(223) 


12(20) 


66(111) 


3(11) 


2(10) 


25(67) 


24(35) 


6(9) 


66(90) 


15(20) 


2(3) 


22 


5 


3 


19 


33 


20 


93 


41 (86) 


4(11) 


120(328) 


2 


4 


0 


3(16) 


2(5) 


4(8) 


7 


13(14) 


17 
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myelin proteins result in severe demyelina- 
tion, which is a pathological condition in 
which the myelin is lost and the nerve con- 
eduction is severely impaired {J 30), Humans 
lave at least 10 genes belonging to four 
different families involved in myelin produc- 



THE Human genome 

tion (five myelin PO, three myelin proteolip- 
id, myelin basic protein, and myelin-oligo- 
dendrocyte glycoprotein, or MOG). and pos- 
sibly more-remotely related members of the 
MOG family. Flies have only a single myelin 
proteolipid, and womis have none at all. 



Intercellular and intracellular signaling 
pathways in development and homeostasis. 
Many protein families that have expanded in 
humans relative to the invertebrates are in- 
volved in signaling processes, particularly in 
response to development and differentiation 



Table 18 {Continued) 



Accession 
number. 



' Domain name 



Domain description . 



H 



W 



Y 



PF00254 
PF01590 
PF01344 
PF0O560 
PF00917 
PF00989 
PF0059S 
PF00169 
PF01535 
PF00536 
PF01369 
PF00017 
PF00018 
PF01740 
PF0O515 
PF00400 
PF00397 
PF00569 

PF01754 
PF01388 
PF0U26 
PF00643 
PF00533 
00439 
00651 
PF00145 
PF00385 



PF00125 
PF00134 
PF00270 
PF01529 
PF00646 
PF00250 
PF00320 
PF01585 
PF00010 
PF00850 
PF00046 
PF01833 
PF02373 
PF02375 
PF00013 
PF01352 
PF00104 

PF00412 
PF00917 
PF00249 
PF02344 
PF01753 
PF00628 
PF00157 
PF02257 
PF00076 



^£021 

^FOU 



■02037 
1622 
01852 
PF00907 



FKBP 

GAF 

Kelch 

LRR** 

MATH 

PAS 

PDZ 

PH 

PPR** 

SAM 

Sec7 

SH2 

SH3 

STAS 

TPR** 

WD40*' 

WW 

Z2 

Zf-A20 

ARID 

BAH 

Zf-B.box** 
. BRCT 
Bromodomain 
BTB 

DNA_methylase 
Chromo 

Histone 

Cyclin 

DEAD 

Zf-DHHC 

F-box** 

ForK_head 

CATA 

C-patch 

HLH** 

Hist_deacetyl 

Homeobox 

TIG 

JmjC 

JmjN 

KH-domain 
KRAB 

Hormone_rec 

UM 
MATH 

Myb.DNA-binding 

Myc-U 

Zf-MYND 

PHD 

Pou 

RFX.DNA.blnding 
Rrm 

SAP 
SPRY 
START 
T-box 



FKBP-type peptidyl-prolyl ds-trans Isomerases 

GAP domain 

Kelch motif 

Leucine Rich Repeat 

MATH domain 

PAS domain 

PDZ domain (Also known as DHR or CLGF) 
PH domain 
PPR repeat 

SAM domain (Sterile alpha motif) 
Sec7 domain 

Src homology 2 (SH2) domain 
Src homology 3 (SH3) domain 
STAS domain 
TPR domain 
WD40 domain 
WW domain 

ZZ-Zinc finger present in dystrophin. CBP/p300 

Nuclear Interaction domains 

'A20-like zinc finger 
ARID OHA binding domain 
BAH domain 
B-box zinc finger 

BRCA1 C Terminus (BRCT) domain 
Bromodomain 
BTB/POZ domain 

C-5 cytosine-specific DNA methylase 
chromo* (CHRromatin Organization Modifier) 
domain 

. Core histone H2A/H2B/H3/H4 
Cyclin 

DEAD/DEAH box heUcase 
DHHC zinc finger domain 
F-box domain 
Fork head domain 
CATA zinc finger 
G-patch domain 

Helbc-loop-helix DNA-binding domain 
Histone deacetylase family 
Homeobox domain 
IPT/TiG domain 
JmjC domain 
JmjN domain 
KH domain 
KRAB box 

Ligand-binding domain of nuclear hormone 

receptor 
UM domain containing proteins 
•MATH domain 

Myb-like DNA-binding domain 
Myc leucine zipper domain 
MYND finger 
PHD-finger • 

Pou domain— N-terminal to homeobox domain 
RFX DNA-binding domain 
RNA recognition motif (a.k.a. RRM, RBD. or RNP 

domain] 
SAP domain 
SPRY domain 
START domain 
T-box 



15(20) 
7(8) 
54(157) 
25(30) 
11 

18(19) 
96(154) 
193 (212) 
5 

29(31) 
13 

87(95) 
143(182) 
5 

72(131) 
136 (305) 
32(53) 
10(11) 



7(8) 
2(4) 
12(48) 
24(30) 
5 

9(10) 
60(87) 
72(78) 
3(4) 
15 
5 

33(39) 
55(75) 
1 

39(101) 
98(226) 
24(39) 
13 



2(8) 


2 


n 




8(10) 


7(8) 


32 (35) 


1 


17(28) 


10(18) 




16(22) 


97(98) 


62 (64) 


3(4) 


1 


24(27) 


14(15) 


75(81) 


5 


19 


10 


63 (66) 


48(50) 


15 


20 


16 


15 


35 (36) 


20(211 


11(17) 


5(6) 


18 


16 


60(61) 


44 


12 


5(6) 


160(178) 


100(103) 


29(53) 


11(13) 


10 


4 


7 


4 


28(67) 


14(32) 


204 (243) 


0 


47 


17 


62 (129) 


33(83) 


11 


5 


32(43) 


18(24) 


1 


0 


14 


14 


68 (86) 


40(53) 


15 


5 


7 


2 


224(324) 


127(199) 


15 


8 


44(51) 


10(12) 


10 


2 


17(19) 


8 



7(13) 


4 


24 (29) 


1 


0 


10 


13(41) 


3 


102 (178) 


7(11) 


1 


15(16) 


88 (161) 


1 


61 (74) 


6 


1 


13(18) 


46(66) 


2 


5 


65 (68) 


24 


23 


0 


1 


474 (2485) 


8 


3 


6 


5 


5 


9 


44(48) 




3 


46(61) 


23(27) 


4 


6 


2 


13 


28(54) 


16(31) 


65(124) 


72 (153) 


56(121) 


167(344) 


16(24) 


5(8) 


11(15) 


10 


2 


10 



2 
4 

. 4(5) 

23(35) 
18(26) 
86(91) 

^ 17(18) 

71 (73) 
10 

55(57) 
16 

309(324) 
15 
8(10) 
13 
24 
8(10) 
82 (84) 
5(7) 
6 
2 

17(46) 
0 

142(147) 

33(79) 
88 (161) 

17(24) 
0 
9 

32(44) 
4 
1 

94(145) 
5 

5(7) 
6 
22 



0 
2 
5 
0 

10(16) 
10(15) 

. 1(2) 
0 

1(2) 

8 
11 

50(52) 
7 
9 
4 
9 
4 
4 
5 
6 
2 
4 
3 

4(14) 
0 
0 

4(7) 
1 

15(20) 
0 
1 

14(15) 
0 
1 

43(73) 

5 
3 
0 
0 



8 
7 

21 (25) 
0 

12(16) 
28 
30(31) 
13(15) 
12 

48 
35 
84(87) 
22 

165 (167) 
0 
26 * 
14(15) 
39 
10 
66 
1 
7 
7 

27(61) 
0 
0 

10(16) 
61 (74) 
243 (401) 
0 
7 

96 (105) 
0 
0 

232 (369) 

6(7) 
6 
23 
0 
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Table 18 {Continued) 
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Accession 
number 

PF02135 
PF01285 
PF02176 
PF00352 

PF00S67 
PF00642 
PFO0O96 
PF00097 
PF00098 



Domain name 

Zf-TAZ 
TEA 

2f-TRAF 
TBP 

TUDOR 
2f-CCCH 
Zf-C2H2** 
2f-C3HC4 
Zf-CCHC 



Domain description 

TAZ finger 
TEA domain 
TRAF-type zinc finger 
Transcription factor TFilD (or TATA-binding 

protein. TBP) . 
TUDOR domain 

Zinc finger C-xS-C-xS-C-xB-H type (and similar) 

Zinc finger. C2H2 type 

Zinc finger, C3HC4 type (RING finger) 

Zinc knuckle 



H 

2(3) 
4 

6(9) 
2(4) 

9(24) 
17(22) 
564(4500) 
135(137) 
9(17) 



1(2) 
1 

1(3) 
4(8) 



9(19) 
6(8) 
234(771) 
57 
6(10) 



w 


Y 




0 


1 


1 


1 


0 


2(4) 


1(2) 


4(5) 


0 


22(42) 


3(5) 


68(155) 


34(56) 


88(89) 


18 


17(33) 


7(13) 



10(15) 
0 
2 

2(4) 
2 

31(46) 
21 (24) 
298 (304) 
68 (91) 



(Tables 18. and 19). They include secreted 
hormones and growth iactors, receptors, in- 
tracellular signaling molecules, and transcrip- 
tion factors. 

Developmental signaling molecules that are 
enriched in the human genome include growth 
factors such as wnt, transforming growth fac- 
tor-p (TGF-P), fibroblast growth factor (FGF), 
nerve growth factor, platelet derived growth 
factor (PDGF), and ephrins. These growth fac- 
tors affect tissue differentiation and a wide 
range of cellular processes involving actin-cy- 
toskeletal and nuclear regulation. The corre- 
sponding receptors of these developmental li- 
gands are also expanded in humans. For exam- 
ple, our analysis suggests at least 8 human 
ephrin genes (2 in the fly, 4 in the worm) and 1 2 
ephrin receptors (2 in the fly, 1 in the worm). In 
the wnt signaling pathway, we find 18 wnt 
family genes (6 in the fly, 5 in the womi) and 
12 fiizzled receptors (6 in the fly, 5 in the 
worm). The Groucho family of transcriptional 
corepressors downstream in the wnt pathway 
are even more markedly expanded, with 13 
predicted members in humans (2 in the fly, 1 in 
the worm). 

Extracellular adhesion molecules involved 
in signahng are expanded in the human genome 
(Tables 18 and 19). The interactions of several 
of these adhesion domains with extracellular 
matrix proteoglycans play a critical role in host 
defense, morphogenesis, and tissue repair 
(131). Consistent with the well-defmed role of 
heparan sulfate proteoglycans in modulating 
these interactions (132), we observe an expan- 
. sion of the heparin sulfate sulfotransferases in 
the human genome relative to worm and fly. 
These sulfotransferases modulate tissue differ- 
entiation (7 J5). A similar expansion in humans 
is noted in structural proteins that constitute the 
actin-cytoskeletal architecture. Compared with 
the fly and worm, we observe an explosive 
expansion of the nebulin (35 domains per pro- 
tein on average), aggrecan (12 domains per 
protein on average), and plectin (5 domains per 
protein on average) repeats in humans. These 
repeats are present in proteins involved in mod- 
ulating the actin-cytoskeleton with predominant 
expression in neuronal, muscle, and vascular 
tissues. 



- Comparison across the.five sequenced eu- 
karyotic organisms revealed several expand- 
ed protein families and domams involved in 
cytoplasmic signal transduction (Table 18). 
In particular, signal transduction pathways 
playing roles in developmental regulation and 
acquired immunity were substantially en- 
riched. There is a factor of 2 or greater ex- 
pansion in humans in the Ras superfamily 
GTPases and the GTPase activator and GTP 
exchange factors associated with them. Al- 
though there are about the same number of 
tyrosine kinases in the human and C. elegans 
genomes, in humans there is an increase in 
the SH2, PTB, and ITAM domains involved 
in phosphotyrosine signal transduction. Fur- 
ther, there is a twofold expansion of phos- 
phodiesterases in the human genome com- 
pared with either the worm or fly genomes. 

The downstream effectors of the intracellu- 
lar signaling molecules include the transcription 
factors that transduce developmental fates. Sig- 
nificant expansions are noted in the ligand- 
binding nuclear hormone receptor class of tran- 
scription factors compared witfi the fly genome, 
although not to the extent observed in the worm 
(Tables 18 and 19). Perhaps the most striking 
expansion in humans is in the C2H2 zinc finger 
transcription factors. Pfam detects a total of 
4500 C2H2 zinc finger domains in 564 human 
proteins, compared with 77 1 in 234 fly proteins. 
This means that there has been a dramatic 
expansion not only in the number of C2H2 
transcription factors, but also in the number of 
these DNA-binding motifs per transcription 
factor (8 on average in humans, 3.3 on average 
in the fly, and 2.3 on average in the worm). 
Furthennore, many of these transcription fac- 
tors contain either the KRAB or SCAN do- 
. mains, which are not found in the fly or worm 
genomes. These domains are involved in the 
ohgomerization of transcription factors and in- 
crease the combinatorial partnering of these 
factors. In general, most of the transcription 
factor domains are shared between the three 
animal genomes, but the reassortment of these 
domains results in organism-specific transcrip- 
tion factor families. The domain combinations 
found in the human, fly, and worm include the 
BTB with C2H2 in the fly and humans, and 



homeodomains alone or in combination with 
Pou and LIM domains in all of the animal 
genomes. In plants, however, a different set of 
transcription factors are expanded, namely, the 
myb family, and a unique set that includes VPl 
and AP2 domain-containing proteins (134), 
The yeast genome has a paucity of transcription 
factors compared with the multicellular eu- 
karyotes, and its repertoire is limited to the 
expansion of the yeast-specific C6 transcription 
factor family involved in metabolic regulation! 

While we have illustrated expansions in a 
subset of signal transduction molecules in the 
human genome compared with the other eu- 
karyotic genomes, it should be noted that 
most of the protein domains are highly con- 
served. An interesting observation is that 
worms and humans have approximately the 
same number of both tyrosine kinases and 
serine/threonine kinases (Table 19). It is im- 
portant to note, however, that these are mere- 
ly counts of the catalytic domain; the proteins 
that contain these domains also display a 
wide repertoire of interaction domams with 
significant combinatorial diversity, 

Hemostasis. Hemostasis is regulated pri- 
marily by plasma proteases of the coagulation 
pathway and by the interactions that occur be- 
tween the vascular endothehum and platelets. 
Consistent with known anatomical and physio- 
logical differences between vertebrates and in- 
vertebrates, extracellular adhesion domains that 
constitute proteins integral to hemostasis are 
expanded in the human relative to the fly and 
worm (Tables 18 and 19). We note the evolu- 
tion of domains such as FIMAC, FNl, FN2. 
and Clq that mediate surface interactions be- 
tween hematopoeitic cells and the vascular ma- 
trix. In addition, there has been extensive re- 
cruitment of more-ancient animal-specific do- 
mains such as VWA, VWC, VWD, kringle, 
and FN3 into multidomain proteins that are 
involved in hemostatic regulation. Although we 
do not find a large expansion in the total num- 
ber of serine proteases, this enzymatic domain 
has been specifically recruited into sever^ of 
these multidomam proteins for proteolytic reg- 
ulation in the vascular compartment These are 
represented in plasma proteins that belong to 
the kinin and complement pathways. There is a 
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significant expansion in two families of matrix 
metaUoproteases: ADAM (a disintegrin and 
meuUoprotease) and MMPs (matrix metaUo- 
proteases) (Table 19). Proteolysis of extiacel. 
I lular matrix (ECM) proteins is critical for tissue 
development and for tissue degradation in dis- 
eases such as cancer, arthritis. Alzheimer's dis- 
ease, and a variety of inflamniatoiy conditions 
{135, 136). ADAMs are a family of integral 
membrane proteins widi a pivotal role in fibrin- 
ogenolysis and • modulating interactions :be-' 
tween hematopoietic components and the 
vascular matrix components. These proteins 
have been shown to cleave matrix proteins, 
and even signaling molecules: ADAM- 17 
converts tumor necrosis factor-ot, and 
ADAM-10 has been implicated m the Notch 
signaling pathway (755). We have identified 
19 members of the matrix metalloprotease 
family, and a total of 51 members of the 
ADAM and ADAM-TS families. 

Apoptosis. Evolutionary conservation of 
some of the apoptotic pathway components 
across eukarya is consistent with its central 
role in developmental regulation and as a 
response to pathogens and stress signals. The 
signal transduction pathways involved in pro- 
grammed cell death, or apoptosis, are medi- 
ated by mteractions between well-character- 
ized domains that include extracellular do- 
mains, adaptor (protein-protein interaction) 
domains, and those found in effector and 
Tegulatory enzymes {137). We enumerated 
■le protein counts of central adaptor and ef- 
Tector enzyme domains that are found only in 
the apoptotic pathways to provide an estimate 
of divergence across eukarya and relative 
expansion in the human genome when com- 
pared with the fly and worm (Table 18). 
Adaptor domains found in proteins restricted 
only to apoptotic regulation such as the DED 
domains are vertebrate-specific, whereas oth- 
ers like BIR, CARD, and Bcl2 are represent- 
ed m the fly and worm (although the number 
of Bcl2 family members in humans is signif- 
icantly expanded). Although plants and yeast 
lack the caspases, caspase-like molecules, 
namely the para- and meta-caspases, have 
been reported in these organisms {138). Com- 
pared with other animal genomes, the human 
genome shows an expansion in the adaptor 
and effector domain-<:ontaining proteins in- 
volved in apoptosis, as well as in the pro- 
teases involved in the cascade such as the 
caspase and calpain families. 

Expansions of other protein families. 
Metabolic enzymes. There are fewer cyto- 
chrome P450 genes in humans than in either 
the fly or worm. Lipoxygenases (six in hu- 
mans), on the other hand, appear to be specific 
to the vertebrates and plants, whereas the lip- 
^jygenase-activating proteins (four in humans) 
be vertebrate-specific. Lipoxygenases are 
Tfivolved in arachidonic acid metabolism, and 
they and their activators have been implicated 
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in diverse human pathology ranging fi-om 
allergic responses to cancers. One of the most 
surprising human expansions, however, is in 
the number of glyceraldehyde-3-phosphate 
dehydrogenase (GAPDH) genes (46 in hu- 
mans. 3 in the fly, and 4 in the worm). There 
is. however, evidence for many retrotrans- 



posed GAPDH pseudogenes {139), which 
may account for this apparent expansion. 
However, it is interesting that GAPDH, long 
known as a conserved enzyme involved- in 
basic metabolism found across all phyla fi-om 
bacteria to humans, has recently been shown 
to have other fimctions. It has a second cat- 



Panther family/subfamily* • ' — 



l^^ec 



Neural 

Ependymin 
Ion channels 
Acetylcholine receptor 

Amnoride-sensitive/degenerin 
CNC/EAG 
IRK 

rrP/iyanodine 
Neurotransmitter-gated 
P2X purinoceptor 
TASK 

Transient receptor 
Voltage-gated Ca^* alpha 
Voltage-gated Ca^* aIpha-2 
Voltage-gated Ca^"^ beta 
Voltage-gated Ca^^- gamma 
Voltage-gated K* alpha 
Voltage-gated KQT 
Voltage-gated Na* 
Myelin basic protein 
Myelin PO 
Myelin proteolipid 

Myelin-oligodendrocyte glycoprotein 
Neuropilin 
Plexin 
Semaphorin 
Synaptotagmin 

Defensin 
Cytokinef 
CCSF 
GMCSF 

Intercrine alpha 
Intercrine beta 
Inteferon 
Interleukin 

Leukemia inhibitory factor 
MCSF 

Peptidoglycan recognition protein 
Pre-B cell enhancing factor 
Small inducible cytokine A 
SI cytokine 
TNF 

Cytokine receptorf 
Bradykinin/C-C chemokine receptor 
Fl cytokine receptor 
Interferon receptor 
Interleukin receptor 
Leukocyte tyrosine kinase 

receptor 
MCSF receptor 
TNF receptor ■ 
Immunoglobulin receptorf 
T-cell receptor alpha chain 
T-cell receptor beta chain 
T-cell receptor gamma chain 
T-cell receptor delta chain 
ImmunoglobuUn FC receptor 
Killer cell receptor 
Polymeric-lmmunoglobulin receptor 



structure, function, development 
1 0 



17 
11 
22 
16 
10 
61 
10 
12 
15 
22 
10 
5 
1 

33 

6 
11 

1 



12 
24 
9 
3 
2 
51 
0 
12 
3 
4 
3 
2 
0 
5 
2 
4 
0 



5 


0 


3 


1 


1 


0 


2 


0 


9 


2 


22 


6 


10 


3 


Immune 


response 


3 


0 


86 


14 


1 


0 


1 


0 


15 


0 


5 


0 


8 


0 


26 


1 


1 


0 


1 


0 


2 


13 


1 


0 


14 


0 


2 


0 


9 


0 


62 


1 


7 


0 


2 


0 


3 


0 


32 


0 


3 


0 


1 


0 


3 


0 


59 


0 


16 


0 


15 


0 


1 


0 


1 


0 


8 


0 


16 


0 


4 


0 



56 

27 
9 
3 
4 

59 

0 
48 

3 

8 

2 

2 

0 
11 

3 

4 

0 

0 
0 
0 
0 
0 
2 
3 



0 

1 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 



0 
0 
0 
0 
0 
0 
0 

1 
1 

2 

0 

0 

0 

0 

0 

9 

0 

0 
0 
0 
0 
0 
0 
0 



0 
0 
0 
0 
0 
0 
0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
0 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 



0 
0 
30 
0 
0 
19 
0 
5 
0 
2 
0 
0 

0 • 

0 

0 

1 

0 
0 
0 
0 
0 

b 

0 
0 



0 
0 
0 
0 
0 
0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
0 
0 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
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alytic activity, as a uracil DNA glycosylase 
{140) and functions as a cell cycle regulator 
{141) and has even been implicated in apo- 
ptosis {142). 

Translation. Another striking set of hu- 
man expansions has occurred in certain fam- 
ilies involved in the translation^ machinery. 
We identified 28 different ribosomal subunits 
.that each have at least 10 copies in the ge- 
nome; on average, for all ribosomal proteins 
there is about an 8- to 10-fold expansion in 
the number of genes relative to either the 
worm or fly. Retrotransposed pseudogenes 
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, may account for many of these expansions 
[see the discussion above and {143)]. Recent 
evidence suggests that a number of ribosomal 
protems have secondary functions indepen- 
dent of their involvement in protein biosyn- 
thesis; for example, L13a and the related L7 
subunits (36 copies in humans) have been 
shown to induce apbptosis {144). 

There is also a four- to fivefold expansion 
in the elongation factor 1 -alpha family 
(eEFlA; 56 human genes). Many of these 
expansions likely represent intronless para- 
logs that have presumably arisen from retro- 



Table 19 (Continued) 



Panther family/subfamily* 



H 



W 



MHC class I 

MHC class II 

Other immunoglobulinf 

Toll receptor-^-elated 



22 
20 
114 
10 



0 
0 
0 
6 



Signaling moleculesf 
Calcitonin 
Ephrin 
FGF 

Glucagon 

Glycoprotein hormone beta chain 
Insulin 

Insulin-like hormone 

Nerve growth factor 

Neuregulin/heregulin 

neuropeptide Y 

PDGF 

Relaxrn 

Stannocalcin 
Thymopoeitin 
Thyomosin beta 
TCF-p 
VEGF 
Wnt 
Receptorst 
Ephrin receptor 
FGF receptor 
Frizzled receptor 
Parathyroid hormone receptor . 
VEGF receptor 

BDNF/NT-3 nerve growth factor 
receptor 

Dual-specificity protein phosphatase 
S/T and dual-specificity protein 

kinasef 
S/r protein phosphatase 

Y protein kinasef 

Y protein phosphatase 
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Cyclic nucleotide phosphodiesterase 
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C-protein modulatorsf 
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Neurofibromin 

Ras CTPase-activating 

Tuberin 

Vav proto-oncogene family 
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transposition, and again there is evidence that 
many of these may be pseudogenes {145). 
However, a second form (eEFlA2) of this 
factor has been identied with tissue-specific 
expression in skeletal muscle and a comple- 
mentary expression pattern to the ubiquitous- 
ly expressed eEFl A (7^5). 

Ribonucleoproteins. PAtcmz^vt splicing 
results in .multiple transcripts from a single 
gene, and can therefore generate additional 
diversity in an organism's protein comple- 
ment. We have identified 269 genes for ri- 
bonucleoproteins. This represents over 2.5 
times the number of ribonucleoprotein genes 
in the.womi, two times that of the fly, and 
about the same as the 265 identified in the 
Arabidopsis genome. Whether the diversity 
of ribonucleoprotein genes in humans con- 
tributes to gene regulation at either the splic- 
ing or translational level is unknown. 

Posttranslational modifications. In this 
set of processes, the most prominent expan- 
sion is the transglutaminases, calcium-depen- 
dent enzymes that catalyze the cross-linking 
of proteins in cellular processes such as he- 
mostasis and apoptosis {147). The vitamin 
K-dependent gamma carboxylase gene prod- 
uct acts on the GLA domain (missing in the 
fly and worm) found in coagulation factors, 
osteocalcin, and matrix GLA protein {148). 
Tyrosylprotein. sulfotransferases participate . 
in the posttranslational modification of pro- 
teins involved in inflammation and hemosta- 
sis, including coagulation factors and chemo- 
kine receptors {149). Although there is no 
significant numerical increase in the counts 
for domains involved in nuclear protein mod- 
ification, there are a number of domain ar- 
rangements in the predicted human proteins 
that are not found in the other currently se- 
quenced genomes. These include the tandem 
association of two histone deacetylase do- 
mains in HD6 with a ubiquitin fmger domain, 
a feature lacking in the fly genome. An ad- 
ditional example is the co-occurrence of im- 
portant nuclear regulatory enzyme PARP 
(poly-ADP ribosyl transferase) domain fused 
to protein-interaction domains— BRCT and 
VWA in humans. 

Concluding remarks. There are several 
possible explanations for the differences in 
phenotypic complexity observed in humans 
when compared to the fly and worm. Some of 
these relate tp the. prominent differences in 
the immune system, hemostasis, neuronal, 
vascular, and cytoskeletal complexity. The 
finding that the human genome contains few- 
er genes than previously predicted might be 
compensated for by combinatorial diversity 
generated at the levels of protein architecture, 
transcriptional and translational control, post- 
translational modification of proteins, or 
posttranscriptional regulation. Extensive do- 
main shuffling to increase or alter combina- 
torial diversity can provide an exponential 
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increase in the ability to mediate protein- 
protein interactions without dramatically in- 
creasing the absolute size of the protein com- 
plement (75^?). Evolution of apparently new 
(from the perspective of sequence analysis) 
protem domains and increasing regulatory 
complexity by domain accretion both quanti- 
tatively and qualitatively (recruitment of nov- 
el domains with preexisting ones) are two 
features that. we. observe in humans. Perhaps ' 
the best, illustration of this trend is the C2H2 
zmc finger-containing transcription factors 
where we see expansion in the number of 
domams per protein, together with verte- 
brate-specific domains such as KRAB and 
SCAN. Recent reports on the prominent use 
of mtemal nbosomal entry sites in the human 
genome to regulate translation of specific 
classes of proteins suggests that this is an area 
that needs fiirther research to identify the full 
extent of this process in the human genome 
U^J), At the posttranslational level, although 
we provide examples of expansions of some 
protem families involved in these modifica- 
tions, further experimental evidence is re- 
quu-ed to evaluate whether this is correlated 
with mcreased complexity in protein process- 
ing. Posttranscriptional processing and the 
extent of isoform generation in the human 
remam to be cataloged in their entirety. Given 
the conserved nature of the spliceosomal ma- 
chmery, further analysis will be required to 
'Ussect regulation at this level. 
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Panther family/subfamily* 



H 



W 



C2H2 zinc finger-contalningf 
CREB 

£TS-related 
Forkhead-related 
FOS . 
Groucho 
Histone HV 
Histone H2A 
Histone H2B 
Histone H3 
Histone H4 
Homeotict 

ABD-B 

Bithoraxold 

Iroquois class 

Distal-less 

Engrailed 

LIM-containing 

MEIS/KNOX class 

NK-3/NK-2 class 

Paired box 

Six 

Leucine zipper 
Nuclear hormone receptort 
Pou-related 
Runt-related 



Transcnptjon fsctors/chromatm organiiathn 



Conclusions 



8.1 The whole-genome sequencing 
approach versus BAC by BAC 

Experience in applying the whole-genome 
shotgun sequencing approach to a diverse 
group of organisms with a wide range of 
genome sizes and repeat content allows us to 
assess its strengths and weaknesses. With the 
success of the method for a large number of 
microbial genomes. Drosophila, and now the 
human, there can be no doubt concerning the 
utility of this method. Tlie large number of 
microbial genomes that have been sequenced 
by this method (15. 80, 152) demonstrate that 
megabase-sized genomes can be sequenced 
efficiently without any input other that the de 
novo mate-paired sequences. With more 
complex genomes like those of Drosopkila or 
human, map information, in the form of well- 
ordered markers, has been critical for long- 
range ordering of scaffolds. For joining scaf- 
folds mto chromosomes, the quality of the 
map (m terms of the order of the markers) is 
more miportant than the number of markers 
per se. Although this mapping could have 
been performed concurrently with sequenc- 
Jj^e prior existence of mapping data was 
VBcial. During the sequencing of the A 
mnana genome, sequencmg of individual 
BAC clones permitted extension of the se- 
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quence well into centromeric regions and al- 
lowed high-quality resolution of complex re- 
peat regions. Likewise, m Drosophila, the 
BAG physical map was most useful in re- 
gions near the highly repetitive centromeres 
and telomeres, WGA has been found to de- 
liver excellent-quality reconstructions of the 
unique regions of the genome. As the genome . 
size, and more importantly the repetitive con- 
tent, increases, the WGA approach delivers 
less of the repetitive sequence. 

The cost and overall efficiency of clone-by- 
clone approaches makes them difficult to justify 
as a stand-alone strategy for future large-scale 
genome-sequencing projects. Specific* applica- 
tions of BAC-based or other clone mapping ^d 
sequencing strategies to resolve ambiguities in . 
sequence assembly that cannot be efficiendy 
resolved with computational approaches alone 
are clearly worth exploring. Hybrid approaches 
to whole-genome sequencing will only work if 
there is sufficient coverage in both the whole- 
genome shotgun phase and the BAG clone se- 
quencing phase.. Our experience with human 
genome assembly suggests that this will require 
at least 3 X coverage of both whole-genome and 
BAG shotgun sequence data. 

8.2 The low gene number in humans 

We have sequenced and assembled -^95% of 
the euchrpmatic sequence of K sapiens and 
used a new automated gene prediction meth- 
od to produce a preliminary catalog of the 
human genes. This has provided a major sur- 
prise: We have found far fewer genes (26,000 
to 38,000) than the earlier molecular pre- 
dictions (50,000 to over 140,000). Whatever 
the reasons for this current disparity, only 
detailed aimotation, comparative genomics 
(particularly using the Mus musculus ge- 
nome), and careful molecular dissection of 
complex "phenotypes will clarify this critical 
issue of the basic "parts list" of our genome. 
Certainly, the analysis is still incomplete and 
considerable refinement will occur in the 
years to come as the precise structure of each 
transcription unit is evaluated. A good place 
to start is to determine why the gene esti- 
mates derived from EST data are so discor- 
dant with our predictions. It is likely that the 
following contribute to an inflated gene num- 
ber derived fi^om ESTs: the variable lengths 
of 3'- and 5 '-untranslated leaders and trailers; 
the little-imderstood vagaries of RNA pro- 
cessing that often leave intronic regions in an 
unspliced conidition; the finding that nearly 
40% of human genes are alternatively spliced 
(153); and finally, the unsolved technical 
problems in EST library construction where 
contamination from heterogeneous nuclear 
RNA and genomic DNA are not imcommon. 
Of course, it is possible that there are genes 
that remam unpredicted owing to the absence 
of EST or protein data to support them, al- 
though our use of mouse genome data for 



predicting genes should limit this number. As 
was true at the beginning of genome sequenc- 
ing, ultimately it will be necessary to measure 
mRNA in specific cell types to demonstrate 
the presence of a gene. 

J. B. S. Haldane speculated in 1937 that a 
population of organisms might have to pay a 
. price for the number of genes it can possibly 
carry. He theorized that when the number of 
genes becomes too large, each zygote carries 
so many new deleterious mutations that the 
population simply cannot, maintain itself On 
the basis of this premise, and on the basis of 
, available mutation rates and x-ray-induced 
mutations at specific loci, Muller, in 1967 
(154), . calculated tiiat the mammalian ge- 
nome would contain a maximum of not much 
more than 30,000 genes (15S). An estimate of 
30,000 gene loci for humans was also arrived 
at by Crow and Kimura (755). Muller's esti- 
mate fori), melanogaster was 10,000 genes, 
compared to 13,000 derived by annotation of 
the fly genome {26, 27). These arguments for 
the theoretical maximum gene number were 
based on simplified ideas of genetic load — 
that all genes have a certain low rate of 
mutation to a deleterious state. However, it is 
clear that many mouse, fly, worm, and yeast 
knockout mutations lead to almost no dis- 
cernible phenotypic perturbations. ' 

The modest number of human genes 
means that we must look elsewhere for the 
mechanisms that generate the complexities. 
. inherent in human development and the so- 
phisticated signaling systems that maintain 
homeostasis. There are a large number of 
ways m which the functions of individual 
genes and gene products are regulated. The 
degree of "openness" of chromatin structure 
and hence transcriptional activity is regulated 
by protein complexes that involve histone 
and DNA enzymatic modifications. We enu- 
merate many of the proteins that are likely 
involved in nuclear regulation in Table 19. 
The location, timing, and quantity of tran- 
scription are intimately linked to nuclear sig- 
nal transduction events as well as by the 
tissue-specific expression of many of these 
proteins. Equally important are regulatory 
DNA elements that include insulators, re- 
peats, and endogenous viruses {157); meth- 
ylation of GpG islands in imprinting (755); 
and promoter-enhancer and intronic regions 
that modulate transcription. The spliceosomal 
machinery consists of multisubunit proteins 
(Table 19) as well as structural and catalytic 
RNA elements {159) that regulate transcript 
structure through alternative start and termi- 
nation sites and splicing. Hence, there is a 
need to study different classes of RNA mol- 
ecules {160) such as small nucleolar RNAs, 
antisense riboregulator RNA, RNA involved 
in X-dosage compensation, and other struc- 
tural RNAs to appreciate their precise role in 
regulating gene expression. The phenomenon 



of RNA editing in which coding changes 
occur directly at the level of mRNA is of 
clinical and biological relevance {161). Final- 
ly, examples of translational control include 
internal ribosomal entry sites that are foimd 
in proteins involved in cell cycle regulation 
and apoptosis;(/52). At the protein level, 
minor altisrations in the .nature of protem- 
protein interactions, protein modifications, 
and localization can have dramatic effects on 
cellular physiology {163). This dynamic sys- 
tem therefore has many ways to modulate 
activity, which suggests that definition of 
complex systems by analysis of single genes 
is unlikely to be entirely successful. 

In situ studies have shown that the human 
genome is asymmetrically populated with 
G+G content, CpG islands, and genes {68). 
However, the genes are not distributed quite 
as unequally as had been predicted (Table 9) 
{69). The most G+C-rich fraction of the ge- 
nome, H3 isochores, constitute more of the 
genome than previously thought (about 9%), 
and are the most gene-dense firaction, but 
contain only 25% of the genes, rather than the 
predicted -40%. The low G+G L isochores 
make up 65% of the genome, and 48% of the 
genes. This inhomogeneity, the net result of 
millions of years of mammalian gene dupli- 
cation, has been described as the "desertifi- 
catioii" of the vertebrate, genome (77). Why 
are. there clustered regions of high and low 
gene density, and are these accidents of his- 
tory or driven by selection and evolution? If 
these deserts are dispensable, it ought to be 
possible to find mammalian genomes that are 
far smaller in size than the human genome. 
Indeed, many species of bats have genome 
sizes that are much smaller than that of hu- 
mans; for example, Miniopterus, a species of 
Italian bat, has a genome size that is only 
50% that of humans (164), Similarly, Mun- 
tiacus, a species of Asian barking deer, has a 
genome size that is --70% that of humans. 

8.3 Human DNA sequence variation 
and its distribution across the genome 

This is the first eukaiyotic genome in which a 
nearly uniform ascertainment of polymorphism 
has been completed. Although we have identi- 
fied and mapped more than 3 million SNPs, this 
by no means implies that the task of finding and 
cataloging SNPs is complete. These represent 
only a fraction of the SNPs present in the 
human population as a whole. Nevertheless, 
this first glimpse at genome-wide variation has 
revealed strong inhomogeneities m the distribu- 
tion of SNPs across the genome. Polymorphism 
in DNA carries with it a snapshot of the past 
operation of population genetic forces, includ- 
ing mutation, migration, selection, and genetic 
drift The availability of a dense array of SNPs 
will allow questions related to each of these 
factors to be addressed on a genome-wide basis. 
SNP studies can estabhsh the range of haplo- 
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types present in subjects of different ethnogeo- 
graphic origins, providing insights into popula- 
tion history and migration patterns. Although 
such studies have suggested that modem human 
lineages derive from Africa, many important 
questions regarding human origins remain un- 
answered, and more analyses using detailed 
SNP maps wiU be needed to settle these con- 
troversies. In addition to providing evidence for 
population expansions, migration, 'and admix-, 
ture, SNPs 'can serve as markers for the extent 
of evolutionary coristraint acting on particular 
genes. The correlation between patterns of in- 
traspecies and interspecies genetic variation 
may prove to be especially infonnative to iden- 
tify sites of reduced genetic diversity that may 
mark loci where sequence variations are not 
tolerated. 

The remarkable heterogeneity in SNP 
density implies that there are a variety of 
forces acting on polymoq)hism— sparse re- 
gions may have lower SNP density because 
the mutation rate is lower, because most of 
those regions have a lower fraction of muta- 
tions that are tolerated, or because recent 
strong selection in favor of a newly arisen 
allele "swept" the linked variation out of the 
population (165). The effect of random ge- 
netic drift also varies widely across the ge- 
nome. The nonrecombining portion of the Y 
chromosome faces the strongest pressure 
from random drift because there are roughly 
one-quarter as many Y chromosomes in the 
population as there are autosomal chromo-' 
sdrnes, and the level of polymorphism on the 
Y is correspondingly less. Similarly, the X 
chromosome has a smaller effective popu- 
lation size than the autosomes, and its nu- 
cleotide diversity is also reduced. But even 
across a single autosome, the effective pop- 
ulation size can vary because the density of 
deleterious mutations may vary. Regions of 
high density of deleterious rhutations will 
see a greater rate of elimination by selec- 
tion, and the effective population size will 
be smaller (755). As a result, the density of 
even completely neutral SNPs will be lower 
in such regions. There is a large literature 
on the association between SNP density 
and local recombination rates in Drqsoph- 
ilq, and it remains an important task to 
assess the strength of this association in the 
human genome, because of its impact on 
the design of local SNP densities for dis- 
ease-association studies. It also remains an 
important task to validate SNPs on a 
genomic scale in order to assess the degree 
of heterogeneity among geographic and 
ethnic populations. 
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8.4 Genome complexity 
. We will soon be in a position to move away 
from the cataloging of individual compo- 
nents of the system, and beyond the sim- 
plistic notions of "this binds to that, which 



then docks on this, and then the complex 

moves there " (167) to the exciting area 

of network perturbations, nonlinear re- 
sponses and thresholds, and their pivotal 
role in human diseases. 

The enumeration of other •'parts lists" re- 
veals that in organisms with complex nervous 
systems, neither gene number, neuron number, 
nor number of cell types correlates in any", 
meaningful maniier with even simplistic mea- 
sures of structural or behavioral complexity. 
Nor would they be expected to; tiiis is the realm 
of nonlinearities and epigenesis (168), The 520 
million neurons of the common octopus exceed 
the neuronal number in the brain of a mouse by 
an order of magnitude. It is qq>arent from a 
comparison of genomic data on fee mouse and 
human, and from comparative mammalian neu- 
roanatomy (/5P), that the morphological and 
behavioral diversity found in mammals is un- 
derpinned by a similar gene repertoire and sim- 
ilar neuroanatomies. For example, when one 
compares a pygmy marmoset (which is only 4 
inches tall and weighs about 6 ounces) to a 
chimpanzee, the brain volume of this minute 
primate is found to be only about 1.5 cm^ two 
orders of magnitude less than that of a chimp 
and three orders less than that of humans. Yet 
the neuroanatomies of all three brains are strik- 
ingly similar, and the behavioral characteristics 
of the pygmy marmoset are little different from 
those of chimpanzees. Between humans and 
chimpanzees, the gene number, gene structures 
and functions, chromosomal and genomic or- 
ganizatiohs, and cell types and neuroanatomies 
are almost indistinguishable, yet the develop- 
mental modifications that predisposed human 
lineages to cortical expansion and development 
of the larynx, giving rise to language, culminat- 
ed in a massive singularity that by even the 
simplest of criteria made humans more com- 
plex in a behavioral sense. 

Simple examination of the number of neu- 
rons, cell types, or genes or of the genome 
size does not alone account for the differenc- 
es iri complexity that we observe. Rather, it is 
the interactions within and among these sets 
that result in such great variation. In addition, 
it is possible that there are "special cases" of 
regulatory gene networks that have a dispro- 
. portionate effect on the overall system. We 
have presented several examples of "regula- 
tory genes" that are significantly increased in 
the human genome compared with &e fly and 
worm. These include extracellular Ugands 
and their cognate receptors (e.g., wnt, friz- ' 
zled, TGF-p, ephrins, and connexins), as well 
as nuclear regulators (e.g., the KRAB and 
homeodomain transcription factor families), 
where a few protems control broad develop- 
mental processes. The answers to these 
"complexities" perhaps lie in these expanded 
gene families and differences in the regulato- 
ry control of ancient genes, proteins, path- 
ways, and cells. 



8.5 Beyond single components 

While few would disagree with the intuitive 
conclusion that Einstein^s brain was more 
complex than that of Drosophild. closcr com- 
parisons such as whether the set of predicted 
human proteins is more complex than the 
protein set of Drosophila, and if so, to what 
degree, are not straightforward, since protein, 
. protein domain, or protein-protem interaction 
measures do: not capture context-dependent 
interactions that underpin, the dynaihics: un- 
derlying phendtype. 

Currently, there are more than 30 different 
mathematical descriptions of complexity (7 7^?). 
However, we have yet to understand the math- 
ematical dependency relating the number of 
genes with organism complexity. One pragmat- 
ic approach to the analysis of biological sys- 
tems, which are composed of nonidentical ele- 
ments (proteins, protein complexes, interacting 
cell ^es, and interacting neuronal popula- 
tions), is through graph theory (777). The de- 
merits of the system can be represented by the 
vertices of complex topographies, with the edg- 
es representing the interactions between them 
Examination of large networics reveals that they 
can self-organize, but more important, they can 
be particularly robust This robustness is not 
due to redundancy, but is a property of inho- 
mogeneously wired networks. The error toler- 
ance of such networks comes with a price; they 
are vulnerable to the selection or removal of a 
few nodes that contribute disproportionately to 
networic stability. Gene knockouts provide an 
illustration. Some knockouts may have minor 
effects, whereas others have catastrophic effects 
on the system. In the case of vimentin, a sup- 
posedly critical component of the cytoplasmic 
intermediate filament network of mammals, the 
knockout of the gene in mice reveals them to be 
reproductively nonmal, with no obvious pheno- 
typic effects (772), and yet the usually conspic- 
uous vimentin network is completely absent 
On the other hand, -30% of knockouts in 
Drosophila and mice correspond to critical 
nodes whose reduction in gene product, or total 
elimination, causes the networic to crash most 
of the time, although even in some of these 
cases, phenotypic nonnalcy ensues, given the 
appropriate genetic background, Thus, there are 
no "good" genes or "bad" genes, but only net- 
worics that exist at various levels and at differ- 
ent connectivities, and at different states of 
sensitivity to perturbation. Sophisticated math- 
ematical analysis needs to be constantly evalu- 
ated against hard biological data sets that spe- 
cifically address network dynamics. Nowhere is 
this more critical than in attempts to come to 
grips with "complexity,'' particularly because 
deconvoluting and correcting complex net- 
worics that have undergone perturbation, and 
have resulted in human diseases, is the greatest 
significant challenge now facing us. 

It has been predicted for the last 15 years 
that complete sequencing of the human ge- 
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nome would open up new strategies for hu- 
man biological research and would have a 
major impact on medicine, and through med- 
icine and public health, on society. Effects on 
biomedical research are already being felt. 
This assembly of the human genome se- 
quence is but a fu-st, hesitant step on a long 
. j and exciting journey toward understanding 

the role of the genome in human biology. It 
has been possible only because of innova- 
tions in instrumentation and software that 
have allowed automation of ahnost every step 
of the process from DNA preparation to an- 
notation. The Bext steps are clear: We must 
define the complexity that ensues when this 
: relatively modest set of about 30,000 genes is 
expressed. The sequence provides the frame- 
work upon which all the genetics, biochem- 
istry, physiology, and ultimately phenotype 
depend. It provides the boundaries for scien- 
tific inquiry. The sequence is only the first 
level of understanding of the genome. All 
genes and their control elements must be 
identified; their functions, in concert as well 
as in isolation, defmed; their sequence varia- 
tion worldwide described; and the relation 
between genome variation and specific phe- 
notypic characteristics determined. Now we 
know what we have to explain. 

Another paramount challenge awaits: 
(public discussion of this inforaiation and its 
potential for improvement of personal health. 
Many diverse sources of data have shown 
that any two individuals are more than 99.9% 
identical in sequence, which means that all 
the glorious differences among individuals in 
our species that can be attributed to genes 
falls in a mere 0.1% of the sequence. There 
are two fallacies to be avoided: detemiinism, 
the Idea that all characteristics of the person 
are |*hard-wired" by the genome; and reduc- 
tionism, the view that with complete knowl- 
edge of the human genome sequence, it is 
only a matter of time before our understand- 
ing of gene functions and interactions will 
provide a complete causal description of hu- 
man variability. The real challenge of human 
biology, beyond the task of finding out how 
genes orchestrate the construction and main- 
tenance of the miraculous mechanism of our 
bodies, will lie ahead as we seek to explain 
how our minds have come to organize 
thoughts sufficiently well to investigate our 
own existence. 
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DNA. The Internally deleted libraries, when plated 

rim^^/Vn""?' n"^ IT'''"'" *^S/ml). carbenl- 
ollm (50 ng/ml). and kanamydn (15 ^g/ml). pro- 
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plates prepared with a fresh top layer containing no 
antibiotic poured on top of a previously set bottom 
layer containing excess antibiotic, to adiieve the 
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ing robots were used to pick colonies meeting strin- 
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weU microtiter plates containing Uquid growth me- 
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passing to template preparation. Template DNA was 
extracted from liquid bacterial culture using a pro- 
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od (773) adapted for high throughput processing in 
384-weIl rriiaotiter plates. Bacterial cells were 

i^d nl^i^M^ni''" centrifugation; 
and plasmid DNA was recovered by Isopropanol 
precipitation and resuspended in 10 mM tris-HCI 
buffer Reagent dispensing operations were accom- 
plished using ritertek MAP 8 liquid dispensing sys- 
tems. Plate-to-plate Uquid transfers were perfomied 
using Tomtec Quadra 384 Model 320 pipetting ro- 
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(Appbed Biosystems) and standard M13 forward 
and reverse primers. Sequendng reactions were pre- 
pared using the Tomtec Quadra 384-320 pipetting - 
robot Parent-diild plate relationships and. by ex- 

IT^'l .T''**"''^^''^ "i3te pairs were 

established by automated pUte barcode reading by 
the onboard barcode reader and were recorded by 
direct UMS communicatioa Sequendng reaction 
products were purified by alcohol precipitation and 
were dried, sealed, and stored at 4«C In the daric 
until needed for sequencing, at whidi time the 
reaction products were resuspended In deionized 
fomiamide and sealed immediately to prevent deg- 
radation. All sequence data were generated using a 
single sequendng platform, the ABI PRJSM 3700 
DNA Analyzer. Sample sheets were created at load 
time using a java-based application that fadUtates 
Darcode scanning of the sequencing plate barcode 
retrieves sample information from the central UMS 
and reserves unique Uace Identifiers. The applica- 
tion pemiitted a single sample sheet file In the 
linking directory and deleted previously created 
sample sheet files Immediately upon scanning of a 
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one another. Next, the resulting BLAST reports are 
parsed, and a graph is created wherein each protein 
corjstitutes a node; any hit between two p?ote?n^ 
t^pch^M *^P*^?^^'°" b«"eath a user-spedfied 
threshold constitutes an edge. Lek then uses this 
graph to compute a similarity between ead) protein 
pair ij in the context of the graph as a whole by 
simply dividing the number of BLAST hiU shared In 
common between the two proteins by the total 
number of proteins hit by / and/ This simple metric 
has several Interesting properties. First, because the 
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multidomaln nature of protein space. Two multido- 
main proteins, for Instance, each containing do- 
mains A and B. will have a greater pairwise similarity 
to each other than either one will have to a protein 
containing only A or B domains, so long as A-B- 
containtng multidomaln proteins are less frequent In 
the proteome than are single-domain proteins con- 
taining A or B domains. A second Interesting prop- 
erty of this similarity metric is that It can be used to 
produce a similarity matrix for the proteome as a 
whole without having to first produce a multiple 
alignment for each protein family, an error-prone 
and very time-consuming process. Finally, the met- 
rfc does not require that either sequence have sig- 
nificant homology to the other In order to have a 
defined similarity to each other, only that they 
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share at least one significant BLAST hit in common. 
This Is an especially Interesting property of the 
metric, because it allows the rapid recoveiy of pro- 
tein families from the proteome for which no mul- 
tiple alignment is possible, thus providing a compu- 
tational basis for the extension of protein homology 
searches beyond those of current HMM- and profile- 
based search methods. Once the whote-proteome 
similarity matrix has been calculated, Lek first par- 
titions the proteome Into single-linkage clusters 
{Z7) on the basis of one or more shared BLAST hits 
between two sequences. Next these single-linkage 
clusters are further partitioned into subdusters, 
each member of which shares a user-spedfied pair- 
wise similarity with the other members of the clus- 
.ter, as described above. For the purposes of this 
publlcatioa we have focused on the analysis of 
single-linkage clusters and what we have termed 
"complete clusters," e.g., those subdusters for 
which every member has a similarity metric of 1 to 
every other member of the subduster. We believe 
that the single-linkage and complete dusters are of 
special interest. In part, because they allow us to 
estimate and to compare sizes of core protein sets 
in a rigorous manner. The rationale for this is as 
follows: If one Imagines for a moment a perfect 
dustering algorithm capable of perfectly partition- 
ing one or more perfectly annotated protein sets 
Into protein families. It is reasonable to assume that 
the number of clusters will always be greater than, 
or equal to, the number of single-linkage clusters, 
because single-linkage dustering is a maximally ag- 
glomerative dustering method. Thus. If there exists 
a single protein in the predicted protein set contain- . 
Ing domains A and B, then it will be clustered by 
single linkage together with all single-domain pro- 
teins containing domains A or B. Likewise, for a 
predicted protein set containing a single multldo- 
main protein, the number of real clusters must 
always be less than or equal to the number of 
complete dusters, because It Is impossible to place 
a unique multidomain protein Into a complete dus-. 
ter. Thus, the single-linkage and complete clusters 
plus singletons should comprise a lower and upper 
bound of sizes of core protein sets, respectively, 
allowing us to compare the relative size and com- 
plexity of different organisnns* predicted protein set 
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(for this analysis. N - 26,588). Allowing for 6' to 
occur as any of the next J-1 proteins [leaving a gap 
between A' and B' Inaeases the probability to (7 - 
^)/N: allowing B'A' or A'B' gives a probability of 2{J 
- 1)/W]. Considering three genes ABC, the probabil- 
ity of observing A'B'C elsewhere In the genome, 
given that the paralogs exist Is 1/W Three pro- 
teins can occur aaoss a spread of five positions In 
six ways; more generally, we compute the number 
of ways that K proteins can be spread across J 
positions by counting alt possible arrangements of AT 
* 2 proteins In the 7 ~ 2 positions between the first 
and last protein. Allowing for a spread to vary from 
K positions (no gaps) to J gives . . 
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arrangements. Thus, the probability of chance occur- 
rence is UN^"*, Allowing for both sets of genes (e.g., 
ABC and A'B'C) to be spread across J positions 
Increases this to iVN**'. The duplicated segment 
might be rearranged by the operations of reversal or 
translocation; allowing for M such rearrangements 
gives us a probability P = L^M/N^-\ For example, the 



probability of observing a duplicated set of three 
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locations. Is 36/W^; the expected number of such 
- matched sets In the predicted protein set Is approx- 
imately (W)36/W2 = 36/A/. a value <SCV Therefore, 
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any of the genes occur In more than two copies, the 
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• A historic 
moment for 
the scientific 
endeavor. 



THE HUMAN 
GENOME 

iimanity has been given a great gift^ With the completion of the hmhm ■ 
genome sequence, we have received a powerful tool for unlocking the 
secrets of our genetic heritage and for finding our place among the other 
participants in the adventure of life. ^ 
This week's issue of Science contains the report of the seqtfencmg of 
the human genome from a group of authors led by Craig Venterlof Celera • 
Genomics. The report of the sequencing of the human genome from the 
publicly funded consortium of laboratories led by Francis Collins appears 
in this week's Nature. This stunning achievement has been portrayed— 
often unfairly— as a competition between two 
ventures, one public and one private. That characterization detracts &om 
the awesome accomplishment jointly unveiled this week. In trutti, each 
project contributed to the other. The inspired vision that launched ttie 
publicly funded project roughly 10 years ago reflected, and now rewards, 
the confidence of those w^o beUeve that the pursuit of large-scale funda- 
mental problems in the life sciences is in the national interest The techmral 
innovation and drive of Craig Venter and his colleagues made it possible 
to celebrate this accompUshment far sooner than was beUeved possible. 
Thus we can salute what has become, in the end, not a contest but a 
marriage (perhaps encouraged by shotgun) between pubUc funding and 
private entrepreneurship. 

There are exceUent scientific reasons for applaudmg an outcome that . . ■ 

has given us two winners. Two sequences are better than one; the opportumty for 
verglnce is invaluable. Indeed, a real-world proof of the importari^^^ 

be found in the pages of this issue otScience. in the comparative analysis by Ohvier et al. (P- 1298^ 

Although w?have made the point before, it is worth repeating that the sequencing of ±e hi^an 
genome represents, not an ending, but the begimiing of a new approach to biology. As G^^s s^s in 
his Viewpoint (p. 257). the knowledge that all of the genetic components of any process can be 
M nSSlSVe extiordinary new power to scientists. Because of this breakthrough, research 
t X^™^mg theXts of individual genes to a more integrated view that ™s 
whole ensembles of gVnes^as they interact to form a Uving human bemg. S^vejl article mtk^^ :^^^ 
• highlight how this approach is already begimiing to revolutionize the v^y ^ 

TOs has been a massive project, on a scale unparalleled m the history of biology. Jut of coi^? 
it has built on the scientific insights of centuries of investigators. By coincidence, this laad^ark 
Announcement falls during the week of the anniversary of the birth of Ch^rf^s 
message that the survival of a species can depend on its ab.hty to evolve m the face of change is 
peculiarly pertinent to discussions that have gone on in the past year over access to Celera date^ 
^^1 infJLtion regarding the agreements that were reached to make ^l^^j:^^^^^^^ 
found at www.sciencemag.org/feature/data/amiouncemen^gsp.sMO We ^<=,r^l">S ' 
allowing data repositories other than the traditional GenBank. while msistmg on access ^ aU 4e 
Z2L to .verify conclusions. In this domain change is ^^^'y^^^J^C^i^^^^^^ 
are producing more and more potentially valuable sequences, yet ^^J^^.f^^ "^^^^^^^ 
laws governing databases provide scant protection agamst piracy. Had &e Celera <iff JJ?!??^ 
cret it would have been a serious loss to the scientific commum^. We hope that our ad^P^^^lity m 
S Schange wUl enable other proprietary data to be published aft^^ 

satisfies our continuing commitment to full access. „ .c.tw created • 

It should be DO surprise that an achievement so stumung. and so carefiUly 
new challenges for S scientific venture. Science is proud to have played a role m brmgmg ttus 
discovery onto the public stage. It is literally tnie that this is a histonc momen for the scientific en- 
?earffiumi'genome has been called the Book of Life. Rafter, it is a IV^^^L^J?^^,'.^ 
niles that encourage exploration and reward creativity, we can find many of the books that vail 
help defme us and our place in the great tapestry of life. - p^^,,, : 
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Paracel BLAST Results vtf ^.v-;.^-^^^^ -hcip' 



MEGABLAST 1 . 2 . 3-Paracel [2001-11-20] 

Ro f GJCOnQQ S ' 

Zheng Zhang, Scott Schwartz, Lukas Wagner, and^Webb Miller (2000), 
"A greedy algorithm for. aligning DNA sequences", 
J Comput Biol 2000; 7(1-2) :203-14. 
Database: Homo_sapiens.latestgp.fa 

26,679 sequences; 200,800,637,119 total letters 

Quory= 1 

(2629 letters) 

Score E 

Sequences producing significant alignments: (bits) Value 

940 0.0 



AC006208. 3. 1.123943 
AC000063. 1.1. 34478 
AC079799. 7. 1.172495 

>AC006208. 3. 1.123943 

Length = 123943 

Score = 940 bits (474), Expect =0.0 
Identities =474/474 (100%) 
Strand = Plus / Minus 



72 2e-09 



54 5e-04 



Query: 2156 caggtgaagacggacgagcgagtcttgcacacggagcgggggctg^ 2215 

lllllllllllllllllllllljilMIIIMIIIIIIIIIIIMIIIIMIIIMIIII 

Sbjct: 44516 caggtgaagacggacgagcgagtcttgcacacggagcgggggctgctgttccgcaggctt 44457 



Query: 2216 agccgtttcgatgcgggcacctacacctgcaccactctggagcatggcttc^ 2275 

llillllllll Illlllllllllllllllllllllllllllll Mlllllllli 

Sbjct : 44456 agccgtttcgatgcgggcacctacacctgcaccactctggagcatggcttctcccagact 



44397 



Query: 2276 gtggtccgcctggctctggtggtgattgtggcctcacagctggacaa^ 2335 

||||||||||||||||||IIIIIIIIIIIIIIIIININIIIIIIIIIIIIIIIIIIII^^,„ 

Sbjct: 44396 gtggtccgcctggctctggtggtgattgtggcctcacagctggacaacctgttccctccg 44337 
Query: 2336 gagccaaagccagaggagcccccagcccggggaggcctggcttccacccca^ 2395 

IIIIIIIIIIIIMIIillllllllllllllilllMIMIIIIIIINIIIIIIIIIII 

Sbjct: 44336 gagccaaagccagaggagcccccagcccggggaggcctggcttccaccccacccaaggcc 44277 
Query: 2396 tggtacaaggacatcctgcagctcattggcttcgccaacctgccccgggtggatgagt^ 2455 

IMIIIIIIIMIIIIIIIIIIMIIIIIIIIIIIIIMI llilllll lll III II ^^^^^ 

Sbjct: 44276 tggtacaaggacatcctgcagctcattggcttcgccaacctgccccgggtggatgagtac 44217 

Query: 2456 tgtgagcgcgtgtggtgcaggggcaccacggaatgctcaggctgcttccggagc^ 2515 

llllllllllllllllllllllllllllllllllllllllllllllllilllllllllll 
Sbjct: 44216 tgtgagcgcgtgtggtgcaggggcaccacggaatgctcaggctgcttccggagccggagc 44157 



http://lexblast.lexgen.com^Iast_results.cgi?id=57997&refresh=30 
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Query: 2516 cggggcaagcaggccaggggcaagagctgggcagggctggagctaggcaagaagatgaag 2575 

IIIIIIIIIIIIIIIIIIIIIMIIIIIMIMIIIIIIIIIIIIIIMIIIIIIIIIIi 

Sbjct: 44156 cggggcaagcaggccaggggcaagagctgggcagggctggagctaggcaagaagatgaag 44097 
Query: -2576 agccgggtgcatgccgagcacaatcggacgccccgggaggtggaggccacgtag 2629 

iiiiiiiiiiiiiiiiiiiiiiiiiiriiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 44096 agccgggtgcatgccgagcacaatcggacgccccgggaggtggaggccacgtag 44043 



Score = 781 bits (394), Expect = 0.0 
Identities = 394/394 (100%) 
Strand = Plus / Minus 

Query: 2 atggcctgtgccctagctgggaaggtcttcccaatggggagctggccagtgtggcacaaa 61 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 53746 atggcctgtgccctagctgggaaggtcttcccaatggggagctggccagtgtggcacaaa 53687 
Query: 62 agcctgcactgggccaacaaggtggaaggagaagcggcaggtggacggcaaggccccagc 121 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 53686 agcctgcactgggccaacaaggtggaaggagaagcggcaggtggacggcaaggccccagc 53627 
Query: 122 ctccttctctcctccgcccctcttcccgcccaggactgggtggagccactgccttataag 181 

lllllllllllllilllllllllillllllllllllllillllllMIMIillllllll 

Sbjct: 53626 ctccttctctcctccgcccctcttcccgcccaggactgggtggagccactgccttataag 53567 
Query: 182 tggtggcctggtggcagcagagcaaactacaaccggcggccagcgggaccagagggcggc 241 

IIIIMIIIIMIIIIIIIIIIIIIiMMIIIIIIIIIIMIMIIIIIililllllM 

, Sbjct: 53566 tggtggcctggtggcagcagagcaaactacaaccggcggccagcgggaccagagggcggc 53507 
Query: 242 tctgcaggcaggcggcagcggtgccctcagttccccagcatggccccctcggcctgggcc 301 

IIIIIIIIIIIIIMIIMIillllllllillMllllllllilllllillllllilill 

Sbjct: 53506 tctgcaggcaggcggcagcggtgccctcagttccccagcatggccccctcggcctgggcc 53447 



Query: 302 atttgctggctgctagggggcctcctgctccatgggggtagctctggccccagccccggc 361 

MllilillillllllillillMllllliillMMIIiiiilllllMI^ 

Sbjct: 53446 atttgctggctgctagggggcctcctgctccatgggggtagctctggccccagccccggc 53387 
Query: 362 cccagtgtgccccgcctgcggctctcctaccgag 395 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 53386 cccagtgtgccccgcctgcggctctcctaccgag 53353 



Score = 462 bits (233), Expect = e-127 
Identities = 233/233 (100%) 
Strand = Plus / Minus 

Query: 1423 gtgccccagcaagatgaccgcacagccaggacggccttttggcagcaccaaggactaccc 1482 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 



http.7/lexblast.lexgen.coni/blast_results.cgi?id=57997&refresh=30 



9/19/2003 



MEGABLAST Search Results ^^Se 3 of 8 

Sbjct: 48539 gtgccccagcaagatgaccgcacagccaggacggccttttggcagcaccaaggactaccc 48480 
Query: 1483 agatgaggtgctgcagtttgcccgagcccaccccctcatg^ 1542 

IMIIIilllllllllllllliMlllililllX'IIMIillliMIIIIIIIIIIII 

Sbjct: 48479 agatgaggtgctgcagtttgcccgagcccaccccctcatgttctggcctgtgcggcctcg 48420 
Ouerv- 1543 acatggccgccctgtccttgtcaagacccacctggcccagcagctacaccagatcgtggt 1602 

iiiiiiiiiiiiliiiiiiiiiiiiiiiliiiiiiiiiiiiiiiiiiiiiiiiiii II ^ 

Sbjct: 48419 acatggccgccctgtccttgtcaagacccacctggcccagcagctacaccagatcgtggt 48360 
Query: 1603 ggaccgcgtggaggcagaggatgggacctacgatgtcattttcctggggac^ 1655 

IMIIIIIIIIIIIIIIIIIIIIIIIIIMIimilNIIIIIIIIIMIM 

Sbjct: 48359 ggaccgcgtggaggcagaggatgggacctacgatgtcattttcctggggactg 48307 



Score = 456 bits (230) Expect = e-125 
Identities = 230/230 (100%) 
Strand = Plus / Minus 

Query: 1789 gcaaatgctatacgtgggctctcggctgggtgtggcccagctgcggctgcaccaatgtga 1848 

IIIIIIIIIIIIIIIIIIIIIIIIIIMIIIIIIIIIIIIIIIIIIIIIIIIIII I Ml ^^^^^ 

Sbjct : 46640 gcaaatgctatacgtgggctctcggctgggtgtggcccagctgcggctgcaccaatgtga 46581 
Query: 1849 gacttacggcactgcctgtgcagagtgctgcctggcccgggacccatactgtg^ 1908 

IIIIIIIIIIIIIIIIIIIIIIIIIIIIMIINIIIIIIIIIIII III! Ill III! 

Sbjct: 46580 gacttacggcactgcctgtgcagagtgctgcctggcccgggacccatactgtgcctggga 46521 
Ouerv: 1909 tggtgcctcctgtacccactaccgccccagccttggcaagcgccggttccgccggcagga 1968 

MIIMIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII ^^^^^ 

Sbjct: 46520 tggtgcctcctgtacccactaccgccccagccttggcaagcgccggttccgccggcagga 46451 
Query: 1969 catccggcacggcaaccctgccctgcagtgcctgggccagagccaggaag 2018 

llllilllllllilllllilllilllllllllllMIIIIIIIIMIIII 

Sbjct: 46460 catccggcacggcaaccctgccctgcagtgcctgggccagagccaggaag 46411 



Score = 327 bits (165) , Expect = 2e-86 
Identities = 165/165 (100%) 
Strand = Plus / Minus 

Query: 394 agacctcctgtctgccaaccgctctgccatctttctgggcccccagggctccctgaacct 453 

llllllilMllirilllllllllllNIIIIIIIIIIIIIMIIIIIIIIIIIIIIIII ^^^^^ 

Sbjct: 51349 agacctcctgtctgccaaccgctctgccatctttctgggcccccagggctccctgaacct 51290 

Query: 454 ccaggccatgtacctagatgagtaccgagaccgcctctttctgggtggcctggacgccct 513 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii ^^^^^ 

Sbjct: 51289 ccaggccatgtacctagatgagtaccgagaccgcctctttctgggtggcctggacgccct 51230 
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Query: 514 ctactctctgcggctggaccaggcatggccagatccccgggag^ 

MIIIIIIIIMIIIIIMIMIIIIIMIIMIIIMIIIIMI 

Sbjct: 51229 ctactctctgcggctggaccaggcatggccagatccccgggaggt 



558 



51185 



Score . = 294 bits (148), Expect 
Identities = 148/148 (100%) 
Strand = Plus / Minus 



3e-76 



Query: 1276 



Sbjct: 



cagtgccgtgttccagggcttcgccgtctgtgtgtaccacatggcagacatctgggaggt 1335 

iiiiiimiJiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiHiiiiiiiiiiii 

48964 cagtgccgtgttccagggcttcgccgtctgtgtgtaccacatggcagacatctgggaggt 48905 



Query: 1336 
Sbjct: 48904 



tttcaacgggccctttgcccaccgagatgggcctcagcaccagtgggggccctatggggg 1395 

iiiiiiiimiiiiiiiiiiiiiiiiiiiiiiiiiiMiiiiiiiiiiiiiiiiii'" 

tttcaacgggccctttgcccaccgagatgggcctcagcaccagtgggggccctatggggg 



48845 



Query: 1396 caaggtgcccttccctcgccctggcgtg 

iiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 48844 caaggtgcccttccctcgccctggcgtg 



1423 



48817 



Score = 292 bits (147) , . Expect = le-75 
Identities = 147/147 (100%) 
Strknd = Plus / Minus 



Guerv: 947 gacccccggtttgtgatggccgcccggatccctgagaactctgaccaggacaatgacaag 

IIIIIIIMIIMIIIIMIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 

Sbjct : 49850 gacccccggtttgtgatggccgcccggatccctgagaactctgaccaggacaatgacaag 



1006 



49791 



Query: 1007 gtgtacttcttcttctcggagacggtcccctcgcccgatggtggctcgaaccatgtcact 

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIMillllllllllllllllllllllNIIII 

Sbjct : 49790 gtgtacttcttcttctcggagacggtcccctcgcccgatggtggctcgaaccatgtcact 



1066 



49731 



Query: 1067 gtcagccgcgtgggccgcgtctgcgtg 1093 

IIIIIIIIIIIIIIIIIMIIIIMM _ - 

Sbjct: 49730 gtcagccgcgtgggccgcgtctgcgtg 49704 



Score = 286 bits (144), Expect = 6e-74 
Identities = 145/146 (99%) 
Strand = Plus / Minus 

Query: 2017 agaagaggcagtgggacttgtggcagccaccatggtctacggcacggagcacaatagcac 2076 
Sbjct: 46108 agaagaggcagtgggacttgtggcagccaccatggtctacggcacggagcacaatagcac 46049 

Query: 2077 cttcctggagtgcctgcccaagtctccccargctgctgtgcgctggctcttgcagaggcc 2136 

iiiiiiMiiiiiiiiiiiiiiiiiiiiii iiiiiiiiiiiiiiiiiiiiiimiiii 
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Sbjct: 46048 cttcct'ggagtgcctgcccaagtctccccaggctgctgtgcgctggctcttgcagaggcc 45989 
Query: 2137 aggggatgaggggcctgaccaggtga 2162 

IIIIIIIIIIMMIIIIIIIIIIil _ 

Sbjct: 45988 aggggatgaggggcctgaccaggtga 45963 



Score = 240 bits (121) / Expect = 3e-60 
Identities = 121/121 (100%) 
. Strand = Plus / Minus 

Query: 619 gacagagtgcgccaacttcgtgcgggtgctacagcctcacaaccggacccacctgctagc 678 

iiiiiiiiiiiiiiiiiiiiiiiiiiiMiiiiiiiimimiiiiiiMiniNii ^ 

Sbjct: 50745 gacagagtgcgccaacttcgtgcgggtgctacagcctcacaaccggacccacctgctagc 50686 
Query: 679 ctgtggcactggggccttccagcccacctgtgccctcatcacagttggccaccgtgggga 738 

lllllllllllllllllllllllllllllilllllllllllllllillNlilMIIIN 

Sbjct: 50685 ctgtggcactggggccttccagcccacctgtgccctcatcacagttggccaccgtgggga 50626 

Query: 739 g 739 
I 

Sbjct: 50625 g 50625 



Score = 236 bits (119), Expect = 5e-59 

Identities = 119/119 (100%) - 
Strand = Plus / Minus 

Query: 829 agacggggagctgtacacgggtctcactgctgacttcctggggcgagaggccatgatctt 888 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct: 50132 agacggggagctgtacacgggtctcactgctgacttcctggggcgagaggccatgatctt 50073 
Query: 889 ccgaagtggaggtcctcggccagctctgcgttccgactctgaccagagtctcttgcacg 947 

lllllllllllilllilMlllilllllilllMMIIIIIIIIIIIIIIIill 

Sbjct: 50072 ccgaagtggaggtcctcggccagctctgcgttccgactctgaccagagtctcttgcacg 50014 



Score = 230 bits (116), Expect = 3e-57 
Identities = 116/116 (100%) 
Strand = Plus / Minus 

Query: 1093 gaatgatgctgggggccagcgggtgctggtgaacaaatggagcactttcctcaaggccag 1152 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii ,,,,, 

Sbjct: 49489 gaatgatgctgggggccagcgggtgctggtgaacaaatggagcactttcctcaaggccag 49430 
Query: 1153 gctggtctgctcggtgcccggccctggtggtgccgagacccactttgaccagctag 1208 

llllllllllllllllllllllllllllllllllllllllllll.lilllillNII 

Sbjct: 49429 gctggtctgctcggtgcccggccctggtggtgccgagacccactttgaccagctag 49374 
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Score = 188 bits (95) , Expect = le-44 
Identities = 95/95 (100%) 
Strand = Plus / Minus 

Query: 1655 gactcagggtctgtgctcaaagtcatcgctctcca^^^^ 

I I I I I I I I I I I I I I I I I I I I i I I I I I I I I I I I I 

Query: 1715. gaagtggttctggaggagctccaggtgtttaaggt 1749 

.. Illllll II Ill" I 48118 

Sbjct: 48152 gaagtggttctggaggagctccaggtgtttaaggt 4811B 



Score = 184 bits (93), Expect = 2e-43 
Identities = 93/93 (100%) 
Strand = Plus / Minus 



Query: 738 agcatgtgctccacctggagcctggcagtgtggaaagtggccgggggcg^ 797 

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMIIIIIIIIIIIIilllllllll! 

Sbjct: 50351 agcitgtgctdcacctggagcctggcagtgtggaaagtggccgggggcggtgccctcacg 502 



Query: 798 agcccagccgtccctttgccagcaccttcatag 830 

IIMIIIIMIIIIiirilllllllllMlill ^ 

Sbjct: 50291 agcccagccgtccctttgccagcaccttcatag 



Score = 143 bits (72), Expect = 6e-31 
Identities = 72/72 (100%) 
Strand = Plus / Minus 

Query: 1207 agaggatgtgttcctgctgtggcccaaggccgggaagagcctcg^ 1266 

lllllllllllllllllllllll IIIIIIIMIIIINIIIIIIIIIIIIIIII'Il ziQ^nfi 
Sbjct: 49265 IgUgatgtgttcctgctgtggcccaaggccgggaagagcctcgaggtgtacgcgctgtt 49206 

Query: 1267 cagcaccgtcag 1278 

IIIIIIIIIIII 
Sbjct: 49205 cagcaccgtcag 49194 



Score = 129 bits (65), Expect = 9e-27 • . 
Identities =65/65 (100%) 
Strand = Plus / Minus 

Query: 555 aggtcctgtggccaccgcagccaggacagagggaggagtgtgttcgaaaggga^ 614 

|||||||lllllllllllllllllllllllllll<l>>>l>><llll''lll"""" ...n. 
Sbjct: 51063 aggtcctgtggccaccgcagccaggacagagggaggagtgtgttcgaaagggaagagatc 51004 
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Query: 615 ctttg 619 

Mill 

Sbjct: 51003 ctttg 50999 



Score = 87,8 bits (44), Expect = 3e-14 
Identities = 44/44 (100%) 
Strand = Plus / Minus 



Query: 1746 aggtgccaacacctatcaccgaaatggagatctctgtcaaaagg 1789 

liiiiiiiiiiiiiMiriNiiiiiiii.iiiiiiiiiiiiiii 

Sbjct: 47403 aggtgccaacacctatcaccgaaatggagatctctgtcaaaagg 47360 



>AC000063. 1.1. 34478 

Length = 34478 

Score =71.9 bits (36), Expect = 2e-09 
Identities = 48/52 (92%) 
Strand = Plus / Minus 



Query: 1860 ctgcctgtgcagagtgctgcctggcccgggacccatactgtgcctgggatgg 1911 

. . . ■ llll.llllll li llllllll illlllMlil lllllllllllllllll 

Sbjct: 5711 ctgcctgtgctgactgctgccttgcGcgggacccttactgtgcctgggatgg 5660 



>AC079799. 7. 1,172495 

Length = 172495 



Score = 54.0 bits (27), Expect = 5e-04 
Identities = 42/47 (89%) 
Strand = Plus / Minus 



Query: 1865 tgtgcagagtgctgcctggcccgggacccatactgtgcctgggatgg 1911 

Mill II llllllillll II lllli lllllllllllllllll 

Sbjct: 151014 tgtgctgactgctgcctggctcgagacccttactgtgcctgggatgg 150968 



Database : Hoino„sapiens . lates tgp .fa 

Posted date: Jul 8, 2003 12:51 PM 
Number of letters in database: 200,800,637,119 
Numbjer of sequences in. database: 26,679 

Lambda K H 

1.37 . 0.711 1.31 

Gapped 

Lambda K H 

1,37 0.711 1.31 



Matrix: blastn matrix :1 -3 

Gap Penalties: Existence: 0, Extension: 0 
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Number of Hits to DB: 0 
length of query: 5260 

•length of database: 200,800,637,119 
effective HSP length: 22 
effective length of query: 2607 
effective search space used: 0 
T: 0 
A: 0 

XI: 0 ( 0.0 bits) 
X2: 20 (39.7 bits) 
SI: 12 (24.3 bits) 
S2: 24 (48.1 bits) 
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